Project Spelunker

_NOPE_ · June 26, 2019

Project Spelunker

***UPDATED ON 07.27.2021***

Want to help bring back the old COH Forums in a readable, searchable way? Well, now you can!

Just perform the following simple steps:

Header over to the Project Spelunker homepage - http://www.cityofplayers.com/project-spelunker/, and download the "CoH WARC Processor" (you can also optionally download the source code if you want to peek in at the working guts of the program and/or want to make sure you can trust what the program is actually doing)
Unzip the program to its own folder, wherever you want
Run the program
Type in your anonymous username (for credit on the eventual Credits page!) and click the big "Process" button:

That's it! Just sit back and watch it spin! Here's what the program does, step by step:

It picks a random number between 1 and the number of WARC files still left to process, then downloads that file number to your Temp directory
It then creates a "CoH_Forums_Output" sub-directory inside of your temp directory (if it doesn't yet exist)
It starts at the number 1, checks to see if there's a Temp\CoH_Forums_Output\1\ folder. If there isn't, it creates one. If there is, it increments the number until it finds an unused number, and creates a temp working directory. This allows multiple instances of the program to run without them crashing into each other.
It then "extracts" the WARC data, restoring the ORIGINAL files that Archive.org archived from the old COH forums back in 2012, to that working directory.
It then ZIPs all of those files up, and sends them back to my server.
Finally, the program cleans up after itself, deleting all of the temp files, queries the FTP server to see how many files are left, picks a new random number and starts all over again

It will do this until either we have processed all of the files, or if you click the "Stop After This Iteration" button, which as the button suggests, stops processing after the current iteration is complete.

If you're curious to see what the program is extracting, you can find all of the output files under the Temp\CoH_Forums_Output\#\ folder, where # is the current running instance. But please, just don't mess with or touch these files, or it might interfere with the process!

Any questions, comments, concerns, bugs, or errors, don't hesitate to contact me, either in this thread or via PM.

And thanks for Spelunking!

Old Outdated Info below...

Want to be able to search the old CoH forums? Me too! So did Zep! So, I decided to take the data from the Internet Archive, parse it, and try to provide access to it to everyone! I decided to start "Project Spelunker" in order to coordinate these efforts. What is there to coordinate? Well, as I've discovered, it's going to be a LONG long process to get all of this data parsed and uploaded to a remote server accessible by all. There's over 18 THOUSAND files, and they each take quite some time to parse and upsert into the database. Who knows how long this would take me if I tried to parse it all alone on my own PC. So, I've made a Parsing program that I've released for anyone to download, that'll let YOU help in the parsing process to bring the old CoH Forums back... sort of... at least it'll be searchable...

If you don't want to help in the project, and just want to see what we've done so far, check out the second post below to see about the Project Spelunker Viewer. However, if you'd like to help with Project Spelunker, here's what you can do:

~~[*]Support my Patreon! http://www.cityofplayers.com/2019/07/04/please-support-city-of-players/ (not required to do the rest, but we have to meet the goal if we want to have a server!)~~

[*]Download the Internet Archive copy of the forums. There's several options - CityOfPlayers.com hosted Torrent (453 KB), CityOfPlayers.com Direct HTTP Download (95GB), Internet Archive Direct HTTP Download (230 GB), Internet Archive hosted Torrent (1.1 GB). These are in order of recommended download. You only have to download one of these, as EACH of them contains a FULL copy of the archived forums! Note: the entire archived forums will be about 735 GB, so if you have that space, feel free to extract it all! If you don't have that much space, you'll need to fire up the Viewer program and pick and choose which file(s) to extract that have NOT already been processed, one by one to save your space!

~~[*]Extract the Archives, using your favorite archive extractor program (I like 7Zip myself) into one directory. If you use the Internet Archive version, you'll have a extract a couple of times.~~

[*]Download the CoH Forums Parser program here. I'm going to trust the community not to decompile the program, steal from it my SQL server information, and screw with my server. That would just be... unheroic.

~~[*]Run the program, pointing it to the location where you have the archive extracted to (better make sure that you extracted them into one directory, or that'll be a slow process!)~~

~~[*]Enter the anonymous UserName by which you want to receive credit for your work.~~

[*]As it's running, the program will parse the WARC files, upsert the data into my database, and then when it's done with a file, it'll delete the local copy and notify the database that the file has been processed, thus telling everyone else's parsers to also delete their copy of that file on the next loop.

~~[*]If you run across any errors with the Parser, it will create an "\Errors\" subdirectory underneath the directory you chose. Please share the Errors.log text file that you find there in this thread!~~

That's it! After you've fired up the program, just let it sit and do it's thing, that's all there is to it! When you want to stop it, just close the window and you can restart it at any time. If you want to run multiple instances of the Parser at the same time, just go ahead and start the program multiple times, and point all of those same instances to whatever folders you want to parse. On every loop, the program automatically checks for and deletes already parsed files, so there's no danger of the programs conflicting with each other.

I'll work on patches to fix up the parser when it inevitably breaks. It's literally parsing archived HTML text stored off the internet seven years ago... there are GOING to be some problems with the data!

The thread title is a reference to the Spelunker mission and badge, because we're literally spelunking into the old archives and trying to pull something back up!

Let's get Spelunking!

Edited July 27, 2021 by The Philotic Knight

_NOPE_ · June 26, 2019

Project Spelunker Viewer

This application will let you search, navigate, and view the results of Project Spelunker. Please keep in mind that this program is in BETA, and will be constantly updated and improved to add additional features as they are requested. The data is not going to be the best, but as more and more people contribute to the project and parse more and more files, more data will become available. And as more people request additional searches, we'll create additional indexing tables that will allow those searches to be performed in a timely manner (meaning, won't time out!).

Here's the breakdown of the program:

[*]Contributors Menu Item - Opens the Contributors window

[*]Search Type ComboBox - Let's you choose which search to run

[*]Value to Search TextBox - Lets you enter the value to search by

[*]Search Button - Performs the search

[*]Results Label - Shows the number of results found, when applicable

[*]Results Grid - Displays the Results

[*]Copy Context Menu - Lets you copy data from one selected cell of data

[*]Export Context Menu - Lets you export the data currently shown onscreen to an Excel file

[*]View HTML Button - Lets you view the HTML content associated with the current row, where applicable

[*]View Wayback Button - Lets you view the related URL in the Internet Archive's Wayback Machine

[*]Contributors Section - Displays the list of contributors, sorted by the number of files parsed/processed

[*]File Processed Section - Displays information about the files that have already been processed

[*]Files To Be Processed Section - Displays information about the files that have not yet been processed

Not seeing what you want to see? Well, that's because currently only 0.2% of the forum's data has been Spelunked into the database. Help out by running the Parser whenever you can! It uses VERY few system resources!

Download the Viewer here.

Oubliette_Red · June 26, 2019

Outstanding!

+1 inf for this PK.

Regarding Step 1, does it matter which type of file we download? I see several listed.

SmalltalkJava · June 26, 2019

Holy cow! This project would work great as a grid computing program like SETI or World Community Grid

I wish I could help but I’m in Alaska in vacation until the 5th.

_NOPE_ · June 26, 2019

Regarding Step 1, does it matter which type of file we download? I see several listed.

The TAR is a direct HTTP download from their website. The TORRENT uses a torrent program to download the same darn TAR. I suggest the torrent if you have a torrent program so that you can download from multiple sources at once. That being said, REDOWNLOAD THE PARSER if you have already, because I just pushed out a critical update that does the following:

You no longer technically "have to" report completing the parsing of a file, I'm now recording that in the database as well, and I'm using a database check to see if a file's already been processed by someone before. Though, if you post here that you've parsed a file, people can still know to delete that file to save the database check.

I've also added a "UserName" function that's required, so that I can give credit at the end of the project to all of the parsers, just give whatever screen name you want to be thanked:

Tomorrow, if I have time at work, I'll think about the best way to release the front end, if I think it's ready. Heading to bed shortly.

Oubliette_Red · June 26, 2019

I would definitely like to help, could someone suggest a good torrent app?

EDIT: Picked up qBittorrent, appears it will still take a considerable amount of time to db the archive.

Amatyr · June 26, 2019

Grabbing the archive now, will leave it running while I work :D

_NOPE_ · June 26, 2019

Parser Update:

I just realized that by the default behavior, the program was going through all of the files in the directory by filename IN ORDER. So... um... that would mean that everyone would be working on the same file at the same time unless they pointed the parser into a directory that only contained certain files rather than all of them. OOPS! To reduce duplication of efforts, I added a randomizer into the program, so it'll now count the total number of files, and on every loop pick a random number and process that file. So please download and run the latest version. I might add a version check in the next version, if I have to make any more versions! But it'll just be a check and won't force an update.

I've also added code so that if a file's already been found to be processed, it's automatically deleted from the computer. If it was deleted because you processed it, it'll still create the skip file so that you can know which files you processed if you want to report them.

I also added the specific filename that you're processing into the console message now:

_NOPE_ · June 26, 2019

I've updated the program again, and made the following changes (OP link updated, please re-download!):

[*]Snazzy new icon based on the Spelunker badge!

[*]Rudimentary Version Check - meaning, I've got a database table that just stores the version number, and before starting each new file, the program will check to make sure that the version number is the same as the number that I have recorded in the database. This way, when bug fixes need to go out, I can force the programs to stop working, but it won't automatically update without you doing it!

[*]Deleted all of the old code that was referencing DataTables - this code was used back when I thought all of this data could fit into memory and then be pushed all at once to the database. Silly me!

I plan to release the source code (minus the connectionString, for what should be obvious reasons), shortly so that y'all can be more confident that the program isn't doing anything shiesty.

Then, I plan on polishing up the code for the front end viewer as well, and adding into the viewer a screen that allows to the the current progress of the project and the contributors, so that you know how far along the project is, and who's been helping. Call it a "Hall of Fame". I intend to rank the listing by number of files processed.

Zep · June 26, 2019

Looks like I have a project when I get home from work. :)

_NOPE_ · June 26, 2019

Here's what I'm working on right now:

A modification to the parser, so that it fetches the global list of all files from my server, then fetches the list of all files that have been already processed on the server. Then, on the loop between files being processed, I'll check those values again, and automatically delete/remove any local files in the working directory that have already been processed. Should save some time and confusion.
Working on cleaning up the Viewer interface a bit, adding a Help menu option and a Contributors window that'll hold all the data about who contributed what, and running totals.
Re-downloading the archive myself to get a fresh unaltered copy to populate the "all files" list, since I'd already deleted the ones that I'd already processed! It'll be interesting to see which downloads first, the HTTP request or the torrent, as they've both going EXCRUCIATINGLY SLOWLY:

Amatyr · June 26, 2019

Using the torrent myself - I've been downloading for 9h, 6.5hrs left to go.

Surprised that I'm not seeing any uploads from my partial files, never connected to more than the seed either that I've noticed.

edit: I see some errors from the trackers so it's actually not using BT at all to do the download.

I see this appears to be a common archive.org problem e.g. https://archive.org/post/1092178/bit-trackers-not-working

Oubliette_Red · June 26, 2019

Using the torrent myself - I've been downloading for 9h, 6.5hrs left to go.

Surprised that I'm not seeing any uploads from my partial files, never connected to more than the seed either that I've noticed.

edit: I see some errors from the trackers so it's actually not using BT at all to do the download.

I see this appears to be a common archive.org problem e.g. https://archive.org/post/1092178/bit-trackers-not-working

My experience was similar.

And when it reached 99% it reported that it was stalled. Although it also reported that 219G was dl'd which was the size of the file. I think is was just a reporting issue and that it actually completed, as the archive folder reported the same size.

Fair warning, be sure that your unzip location has enough space. I started unzipping before I realized that my destination was not my data drive with adequate storage. XD

_NOPE_ · June 26, 2019

I've uploaded the beta version of the Viewer. You can find the link in the second post of this thread, along with a basic breakdown of the program's functionality. So now you can start to see what you're working for!

_NOPE_ · June 26, 2019

I've also just update the Parser to automatically delete files that have already been parsed. Redownload the Parser if you want this new functionality.

Oubliette_Red · June 26, 2019

Whelp, it appears to be doing something. ^.^

'boards.cityofheroes.com-threads-range-25339-20120905-082202' done

Currently parsing 'boards.cityofheroes.com-threads-range-12304-20120905-142216' file.

_NOPE_ · June 26, 2019

And our first "non-me" Contributor is on the board! Thanks Oubliette_Red!

WanderingAries · June 27, 2019

I got lost somewhere in there. Is there not a way to specifically tell it what to process and therefor allow You do dictate who takes what chunk?

_NOPE_ · June 27, 2019

Sure you can! Just move a specific file to a folder by itself and then point the program to that directory.

Zep · June 27, 2019

Torrent DL Started

Work folder created on my fastest drive. Bit big for me to make a ram drive :(

Oubliette_Red · June 27, 2019

It's a good thing you built the 'remove already-parsed file' feature into to this app PK, after a while just listing and manually removing those will get lengthy.

_NOPE_ · June 27, 2019

Holy shniekies, how are you processing files so fast? Did you just pick a bunch of small files and stick them in a folder by themselves?

Curious about your system specs now.

Oubliette_Red · June 27, 2019

PK, I got the following error parsing a group of 5 files.

Oubliette_Red · June 27, 2019

Holy shniekies, how are you processing files so fast? Did you just pick a bunch of small files and stick them in a folder by themselves?

Curious about your system specs now.

Caught in the act. :D

I figured I grab some smaller ones since the second one I parsed was something like 250k.

_NOPE_ · June 27, 2019

CHEATER CHEATER PUMPKIN EATER!

Heh, that's fine, it all goes to the cause. And we're in this for the long haul. I just wonder what it would take to get others to help, because really, the program takes VERY little resources, and runs in the background, and doesn't really interfere with anything (unless they come across an error).

Speaking of which, I'll get the patched up shortly. Also, with the Viewer now, you don't really have to report the files that you processed. Hell, I probably should do away with creating those WARCSKIP files too, since I'm also using the database check, it's kind of redundant...

Sign In

Project Spelunker

Recommended Posts

_NOPE_

_NOPE_

Oubliette_Red

SmalltalkJava

_NOPE_

Oubliette_Red

Amatyr

_NOPE_

_NOPE_

Zep

_NOPE_

Amatyr

Oubliette_Red

_NOPE_

_NOPE_

Oubliette_Red

_NOPE_

WanderingAries

_NOPE_

Zep

Oubliette_Red

_NOPE_

Oubliette_Red

Oubliette_Red

_NOPE_

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Game Account

Wikis