Jump to content

Recommended Posts

Badge_stature_02.png Project Spelunker Badge_stature_02.png

 

 

Want to be able to search the old CoH forums? Me too! So did Zep! So, I decided to take the data from the Internet Archive, parse it, and try to provide access to it to everyone! I decided to start "Project Spelunker" in order to coordinate these efforts. What is there to coordinate? Well, as I've discovered, it's going to be a LONG long process to get all of this data parsed and uploaded to a remote server accessible by all. There's over 18 THOUSAND files, and they each take quite some time to parse and upsert into the database. Who knows how long this would take me if I tried to parse it all alone on my own PC. So, I've made a Parsing program that I've released for anyone to download, that'll let YOU help in the parsing process to bring the old CoH Forums back... sort of... at least it'll be searchable...

 

If you don't want to help in the project, and just want to see what we've done so far, check out the second post below to see about the Project Spelunker Viewer. However, if you'd like to help with Project Spelunker, here's what you can do:

 

[*]Support my Patreon! http://www.cityofplayers.com/2019/07/04/please-support-city-of-players/ (not required to do the rest, but we have to meet the goal if we want to have a server!)

[*]Download the Internet Archive copy of the forums. There's several options - CityOfPlayers.com hosted Torrent (453 KB), CityOfPlayers.com Direct HTTP Download (95GB), Internet Archive Direct HTTP Download (230 GB), Internet Archive hosted Torrent (1.1 GB). These are in order of recommended download. You only have to download one of these, as EACH of them contains a FULL copy of the archived forums! Note: the entire archived forums will be about 735 GB, so if you have that space, feel free to extract it all! If you don't have that much space, you'll need to fire up the Viewer program and pick and choose which file(s) to extract that have NOT already been processed, one by one to save your space!

[*]Extract the Archives, using your favorite archive extractor program (I like 7Zip myself) into one directory. If you use the Internet Archive version, you'll have a extract a couple of times.

[*]Download the CoH Forums Parser program here. I'm going to trust the community not to decompile the program, steal from it my SQL server information, and screw with my server. That would just be... unheroic.

[*]Run the program, pointing it to the location where you have the archive extracted to (better make sure that you extracted them into one directory, or that'll be a slow process!)

[*]Enter the anonymous UserName by which you want to receive credit for your work.

[*]As it's running, the program will parse the WARC files, upsert the data into my database, and then when it's done with a file, it'll delete the local copy and notify the database that the file has been processed, thus telling everyone else's parsers to also delete their copy of that file on the next loop.

[*]If you run across any errors with the Parser, it will create an "\Errors\" subdirectory underneath the directory you chose. Please share the Errors.log text file that you find there in this thread!

That's it! After you've fired up the program, just let it sit and do it's thing, that's all there is to it! When you want to stop it, just close the window and you can restart it at any time. If you want to run multiple instances of the Parser at the same time, just go ahead and start the program multiple times, and point all of those same instances to whatever folders you want to parse. On every loop, the program automatically checks for and deletes already parsed files, so there's no danger of the programs conflicting with each other.

 

I'll work on patches to fix up the parser when it inevitably breaks. It's literally parsing archived HTML text stored off the internet seven years ago... there are GOING to be some problems with the data!

 

The thread title is a reference to the Spelunker mission and badge, because we're literally spelunking into the old archives and trying to pull something back up!

 

Let's get Spelunking!

 

Badge_stature_02.png

Share this post


Link to post
Share on other sites

Badge_stature_02.pngProject Spelunker ViewerBadge_stature_02.png

 

This application will let you search, navigate, and view the results of Project Spelunker. Please keep in mind that this program is in BETA, and will be constantly updated and improved to add additional features as they are requested. The data is not going to be the best, but as more and more people contribute to the project and parse more and more files, more data will become available. And as more people request additional searches, we'll create additional indexing tables that will allow those searches to be performed in a timely manner (meaning, won't time out!).

 

Here's the breakdown of the program:

sEX6G9U.png

UVzPgd9.png

wuYFxbQ.png

 

[*]Contributors Menu Item - Opens the Contributors window

[*]Search Type ComboBox - Let's you choose which search to run

[*]Value to Search TextBox - Lets you enter the value to search by

[*]Search Button - Performs the search

[*]Results Label - Shows the number of results found, when applicable

[*]Results Grid - Displays the Results

[*]Copy Context Menu - Lets you copy data from one selected cell of data

[*]Export Context Menu - Lets you export the data currently shown onscreen to an Excel file

[*]View HTML Button - Lets you view the HTML content associated with the current row, where applicable

[*]View Wayback Button - Lets you view the related URL in the Internet Archive's Wayback Machine

[*]Contributors Section - Displays the list of contributors, sorted by the number of files parsed/processed

[*]File Processed Section - Displays information about the files that have already been processed

[*]Files To Be Processed Section - Displays information about the files that have not yet been processed

 

Not seeing what you want to see? Well, that's because currently only 0.2% of the forum's data has been Spelunked into the database. Help out by running the Parser whenever you can! It uses VERY few system resources!

 

Download the Viewer here.

 

Badge_stature_02.png

 

Share this post


Link to post
Share on other sites

Holy cow!  This project would work great as a grid computing program like SETI or World Community Grid

 

I wish I could help but I’m in Alaska in vacation until the 5th. 

Share this post


Link to post
Share on other sites
Regarding Step 1, does it matter which type of file we download? I see several listed.

 

The TAR is a direct HTTP download from their website. The TORRENT uses a torrent program to download the same darn TAR. I suggest the torrent if you have a torrent program so that you can download from multiple sources at once. That being said, REDOWNLOAD THE PARSER if you have already, because I just pushed out a critical update that does the following:

 

You no longer technically "have to" report completing the parsing of a file, I'm now recording that in the database as well, and I'm using a database check to see if a file's already been processed by someone before. Though, if you post here that you've parsed a file, people can still know to delete that file to save the database check.

 

I've also added a "UserName" function that's required, so that I can give credit at the end of the project to all of the parsers, just give whatever screen name you want to be thanked:

OgHMz5O.png

 

Tomorrow, if I have time at work, I'll think about the best way to release the front end, if I think it's ready. Heading to bed shortly.

Share this post


Link to post
Share on other sites

I would definitely like to help, could someone suggest a good torrent app?

EDIT: Picked up qBittorrent, appears it will still take a considerable amount of time to db the archive.

 


Dislike certain sounds? Head down to Silence/Modify specific sounds in Guides.

Share this post


Link to post
Share on other sites

Grabbing the archive now, will leave it running while I work  :D


Excelsior Global Channel - for your server wide chat and forming TFs, Trials, Radios, Farms, whatever you want to do - /chan_join Excelsior today!

Share this post


Link to post
Share on other sites

Parser Update:

 

I just realized that by the default behavior, the program was going through all of the files in the directory by filename IN ORDER. So... um... that would mean that everyone would be working on the same file at the same time unless they pointed the parser into a directory that only contained certain files rather than all of them. OOPS! To reduce duplication of efforts, I added a randomizer into the program, so it'll now count the total number of files, and on every loop pick a random number and process that file. So please download and run the latest version. I might add a version check in the next version, if I have to make any more versions! But it'll just be a check and won't force an update.

 

I've also added code so that if a file's already been found to be processed, it's automatically deleted from the computer. If it was deleted because you processed it, it'll still create the skip file so that you can know which files you processed if you want to report them.

 

I also added the specific filename that you're processing into the console message now:

 

uO22DLm.png

 

Share this post


Link to post
Share on other sites

I've updated the program again, and made the following changes (OP link updated, please re-download!):

 

[*]Snazzy new icon based on the Spelunker badge!

[*]Rudimentary Version Check - meaning, I've got a database table that just stores the version number, and before starting each new file, the program will check to make sure that the version number is the same as the number that I have recorded in the database. This way, when bug fixes need to go out, I can force the programs to stop working, but it won't automatically update without you doing it!

[*]Deleted all of the old code that was referencing DataTables - this code was used back when I thought all of this data could fit into memory and then be pushed all at once to the database. Silly me!

 

I plan to release the source code (minus the connectionString, for what should be obvious reasons), shortly so that y'all can be more confident that the program isn't doing anything shiesty.

 

Then, I plan on polishing up the code for the front end viewer as well, and adding into the viewer a screen that allows to the the current progress of the project and the contributors, so that you know how far along the project is, and who's been helping. Call it a "Hall of Fame". I intend to rank the listing by number of files processed. los3Hmo.gif

Share this post


Link to post
Share on other sites

Looks like I have a project when I get home from work. :)


** Asus Crosshair VI Hero, Ryzen 1800x, TridentZ 64GB DDR4 @ 3000, GTX 1080 ti, 48" 4K Samsung 3d & 56" 4k UHD, Storage:  WD-Blue 4TB, Intel M.2 NVME 500 GB On MOBO, StarTech PCIE 3.0 x4, 1x M.2 NVMe 1TB 2X M.2 SATA III [Crucial MX 500gb], [samsung Evo 850], Optical Asus BluRay Quad Layer 16x **

Share this post


Link to post
Share on other sites

Here's what I'm working on right now:

 

  • A modification to the parser, so that it fetches the global list of all files from my server, then fetches the list of all files that have been already processed on the server. Then, on the loop between files being processed, I'll check those values again, and automatically delete/remove any local files in the working directory that have already been processed. Should save some time and confusion.
  • Working on cleaning up the Viewer interface a bit, adding a Help menu option and a Contributors window that'll hold all the data about who contributed what, and running totals.
  • Re-downloading the archive myself to get a fresh unaltered copy to populate the "all files" list, since I'd already deleted the ones that I'd already processed! It'll be interesting to see which downloads first, the HTTP request or the torrent, as they've both going EXCRUCIATINGLY SLOWLY:
    tbv0pAo.png
     

Share this post


Link to post
Share on other sites

Using the torrent myself - I've been downloading for 9h, 6.5hrs left to go.

Surprised that I'm not seeing any uploads from my partial files, never connected to more than the seed either that I've noticed.

 

edit: I see some errors from the trackers so it's actually not using BT at all to do the download.

I see this appears to be a common archive.org problem e.g. https://archive.org/post/1092178/bit-trackers-not-working


Excelsior Global Channel - for your server wide chat and forming TFs, Trials, Radios, Farms, whatever you want to do - /chan_join Excelsior today!

Share this post


Link to post
Share on other sites

Using the torrent myself - I've been downloading for 9h, 6.5hrs left to go.

Surprised that I'm not seeing any uploads from my partial files, never connected to more than the seed either that I've noticed.

 

edit: I see some errors from the trackers so it's actually not using BT at all to do the download.

I see this appears to be a common archive.org problem e.g. https://archive.org/post/1092178/bit-trackers-not-working

 

My experience was similar.

 

And when it reached 99% it reported that it was stalled. Although it also reported that 219G was dl'd which was the size of the file. I think is was just a reporting issue and that it actually completed, as the archive folder reported the same size.

 

Fair warning, be sure that your unzip location has enough space. I started unzipping before I realized that my destination was not my data drive with adequate storage. XD


Dislike certain sounds? Head down to Silence/Modify specific sounds in Guides.

Share this post


Link to post
Share on other sites

Whelp, it appears to be doing something. ^.^

 

'boards.cityofheroes.com-threads-range-25339-20120905-082202' done

Currently parsing 'boards.cityofheroes.com-threads-range-12304-20120905-142216' file.


Dislike certain sounds? Head down to Silence/Modify specific sounds in Guides.

Share this post


Link to post
Share on other sites

I got lost somewhere in there. Is there not a way to specifically tell it what to process and therefor allow You do dictate who takes what chunk?


OG Server: Pinnacle || Current Server: Torchbearer  || Also found on Indomitable & Excelsior (when needed)

Getting Started  ||  Dredd's Guide to Loading City of Heroes  ||  Install and Troubleshooting Guide v2

Mids' Reborn  ||  HeroStats  ||  Vidiot Maps  ||  QoL, Mods, etc  ||  Heroica! (by @Shenanigunner)

Old Forums  ||  Titan Network  ||  The City Representative (3rd party Info site for all servers)

Share this post


Link to post
Share on other sites

Torrent DL Started

 

Work folder created on my fastest drive. Bit big for me to make a ram drive :(

 

 


** Asus Crosshair VI Hero, Ryzen 1800x, TridentZ 64GB DDR4 @ 3000, GTX 1080 ti, 48" 4K Samsung 3d & 56" 4k UHD, Storage:  WD-Blue 4TB, Intel M.2 NVME 500 GB On MOBO, StarTech PCIE 3.0 x4, 1x M.2 NVMe 1TB 2X M.2 SATA III [Crucial MX 500gb], [samsung Evo 850], Optical Asus BluRay Quad Layer 16x **

Share this post


Link to post
Share on other sites

Holy shniekies, how are you processing files so fast? Did you just pick a bunch of small files and stick them in a folder by themselves?

 

Curious about your system specs now.

 

Caught in the act. :D

 

 

I figured I grab some smaller ones since the second one I parsed was something like 250k.


Dislike certain sounds? Head down to Silence/Modify specific sounds in Guides.

Share this post


Link to post
Share on other sites

CHEATER CHEATER PUMPKIN EATER!

 

MUgEwc4.png

 

Heh, that's fine, it all goes to the cause. And we're in this for the long haul. I just wonder what it would take to get others to help, because really, the program takes VERY little resources, and runs in the background, and doesn't really interfere with anything (unless they come across an error).

 

Speaking of which, I'll get the patched up shortly. Also, with the Viewer now, you don't really have to report the files that you processed. Hell, I probably should do away with creating those WARCSKIP files too, since I'm also using the database check, it's kind of redundant...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...