Project Spelunker

WanderingAries · September 2, 2021

There's still some in the queue it said and I hadn't read here yet before grabbing, but hit this quickly before just shutting them all down. It looks like we were somehow grabbing the same files too as I had the 117 sitting for a day, but it attempted to redownload it today.

There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-21273-20120904-160852.warc' ---> System.Exception: Failed to handle large WARC file ---> System.Exception: Failed to get filename from string 'http://badge-hunter.com/castle_says.php?r1=V%20pehfu%20lbh.&r2=...&r3=Lbh%20yvxr%20guvf.'. ---> System.ArgumentException: String cannot be of zero length.
Parameter name: oldValue
   at System.String.ReplaceInternal(String oldValue, String newValue)
   at System.String.Replace(String oldValue, String newValue)
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 555
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 666
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 241
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 261
   at WHBennett.WarcExtractionRecord..ctor(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 33
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 126
   --- End of inner exception stack trace ---

_NOPE_ · September 3, 2021

Yeah, just cool it for now. Those last four will probably be in and out of the input queue as I'm trying to troubleshoot the issues with them. Patience, sir!

In the meantime, trying to re-torrent the original archive from archive.org... has been stuck at 99.9% for like a week now 😞 :

So I started trying to download from the HTTP link on that page, and I sure as hell hope that I don't lose connection any time soon, because I'm unsure if that's resumable:

😢

_NOPE_ · September 3, 2021

Kind of fun to watch it work live:

_NOPE_ · September 3, 2021

This is what happens when you want to make sure things didn't lock up, and you want to SEE a "large" file processing byte by byte. It REALLY slows down the process!

This is precisely why I DIDN'T have "byte by byte reporting" on the main app, and only save that for debugging "problem" files. If I hadn't, this process would have taken YEARS time just to update the screen for everyone.

Edited September 3, 2021 by The Philotic Knight

_NOPE_ · September 3, 2021

Here's that final list, by the way, the "Final Four":

~~boards.cityofheroes.com-threads-range-18694-20120904-094534~~
boards.cityofheroes.com-threads-range-18341-20120905-142114
boards.cityofheroes.com-threads-range-21273-20120904-160852
boards.cityofheroes.com-threads-range-19175-20120905-235117

And, after some modification to my code, I was able to successfully parse boards.cityofheroes.com-threads-range-18694-20120904-094534. We'll have to wait and see about the rest. That one took a few hours to parse, and it's only a 20MB file. I blame the additional visual display of progress for that ridiculous processing time, because even large files gigabytes big didn't even take that long!

Working on boards.cityofheroes.com-threads-range-19175-20120905-235117 right now.

Edited September 3, 2021 by The Philotic Knight

_NOPE_ · September 3, 2021

Whatever changes I made appears to be working now, because 117 just processed with no issues, and left me with a working copy of Vanden's animated gif avatar:

Vanden · September 3, 2021

Neat

_NOPE_ · September 3, 2021

8 minutes ago, Vanden said:

Neat

Fun fact - I use the largest image file inside of the zip file as the "test" to see if the parsing was a success. Because if ANY file corruption is going to happen, it's going to happen there. And your avatar just happened to be the largest image in that file.

WanderingAries · September 4, 2021

Yup, those were the problem children alright.

WanderingAries · September 5, 2021

On 9/3/2021 at 11:38 AM, The Philotic Knight said:

Here's that final list, by the way, the "Final Four":

~~boards.cityofheroes.com-threads-range-18694-20120904-094534~~

boards.cityofheroes.com-threads-range-18341-20120905-142114

boards.cityofheroes.com-threads-range-21273-20120904-160852

boards.cityofheroes.com-threads-range-19175-20120905-235117

And, after some modification to my code, I was able to successfully parse boards.cityofheroes.com-threads-range-18694-20120904-094534. We'll have to wait and see about the rest. That one took a few hours to parse, and it's only a 20MB file. I blame the additional visual display of progress for that ridiculous processing time, because even large files gigabytes big didn't even take that long!

Working on boards.cityofheroes.com-threads-range-19175-20120905-235117 right now.

So...."Are we there yet?" :p

_NOPE_ · September 5, 2021

54 minutes ago, WanderingAries said:

So...."Are we there yet?" 😛

Down to two files, and this is one of them:

_NOPE_ · September 5, 2021

Down to the last file with about 60% left to go, when I ran across THIS asshole:

On 9/2/2021 at 2:37 PM, WanderingAries said:

There's still some in the queue it said and I hadn't read here yet before grabbing, but hit this quickly before just shutting them all down. It looks like we were somehow grabbing the same files too as I had the 117 sitting for a day, but it attempted to redownload it today.


There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-21273-20120904-160852.warc' ---> System.Exception: Failed to handle large WARC file ---> System.Exception: Failed to get filename from string 'http://badge-hunter.com/castle_says.php?r1=V%20pehfu%20lbh.&r2=...&r3=Lbh%20yvxr%20guvf.'. ---> System.ArgumentException: String cannot be of zero length.
Parameter name: oldValue
   at System.String.ReplaceInternal(String oldValue, String newValue)
   at System.String.Replace(String oldValue, String newValue)
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 555
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 666
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 241
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 261
   at WHBennett.WarcExtractionRecord..ctor(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 33
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 126
   --- End of inner exception stack trace ---

And even though I'd put a breakpoint right on the line that failed, for some reason Visual Studio would NOT let me unwind that fucker. So, I made the modifications that I "THINK" I'd need to make to turn that URL into a proper filename without fail, and we're starting over with that file from scratch **sigh**. 😢

But still... final file, y'all. Then I can unzip them all into their individual sub-folders, then re-zip them into a single 7z archive at the maximum compression level, and release Phase 1 (The Imperfect Collection™) into the wild, while I work on Phase 2, which will be the following:

Consolidate the condense the files to remove duplicates (this one I think I have to do on my own), then
Go through EVERY single former PHP/now HTML file and replace all boards.cityofheroes.com or boards.cityofvillains.com URL references with file references that have been standardized through my "GetSanitizedFileName" method, and also point those references to the relative local path of that file (if I can find it on the PC after step 1 above!), to make all internal files link to each other successfully.

Now, that second step I really strongly believe I'll have to crowdsource, just to to the sheet massive VOLUME of HTML files there are. So I've been thinking out how I'm going to do that, and I think I know how:

Make a "Phase 2" folder, and copy the Phase 1 folder over to the Phase 2 folder, maintaining the folder structure.
Make a "mini-app", that scans the Phase 2 directory for all HTML files, and creates a Dictionary file - basically a CSV file with only two "columns" containing the filename as column 1, and the full "relative path" as column 2. Hell, might as well make this dictionary for ALL files, rather than just HTML files.
COPY all of the HTML files over into a single new "input" directory for the Phase 2 processor to work with.
Modify the former WARC processor to be a Phase 2 processor that behaves in a similar way to how the Phase 1 processor was behaving (downloading, uploading, etc.), except instead of parsing WARCS, it performs the following tasks:
- Download a copy of my "dictionary" file from step 1 above, and load it into memory
- Downloads each individual HTML file, one at a time from the input directory, moves the file to a processing directory
- Using the HTML Agility Pack find all of those file references, one at a time in the downloaded file
- Nab JUST the "filename" portion of that reference, and standardize it using that same method I used to process the original files in Phase 1
- Look up that filename in the dictionary in column 1, and if I can find it, swap out the reference in that link with the relative path that I can find in column 2
- Save the file at the end
Once all references are replaced in the HTML file, the original HTML file on the server is moved into the "processed" directory, while the newly "fixed" file is slotted over into its' twin's location back in the "Phase 2" folder in its original relative path (again, using the dictionary to find its relative location)

I think this Phase 2 process will actually end up going MUCH faster than Phase 1 ONCE IT GETS STARTED, but it will be much slower to get started, because of all of the modifications to the original engine I'll have to make, and all of the testing/troubleshooting I'll have to do to get the process "just right".

Luckily, I'll keep the original Phase 1 files completely separate from this whole thing to prevent any corrupting from my testing in this next process.

Once Phase 2 completes, then it's just a simple matter of making a final "mini-app" that scans the Phase 2 folder, and create a final web index site map file to pass to Google. Then pass it to Google and wait for them to properly index it. The cool thing about this is that if this works the way that I think it'll work, I can bundle this site map along with all of the other Phase 2 files in a second 7z file, release it to the world, and then anyone can host their own local copy of the old CoH forums if they wish!

Hell of a road to get there though, and part of me thinks "Was it really worth all this? Or am I Don Quixote tilting at windmills"? Then I remember that even doing this work will help me be a better programmer from sheer practice and figuring out/solving the problem, so the answer is YES, even if only for my own selfish purposes. 🙂

_NOPE_ · September 5, 2021

Latest update on that final problem child:

Now, remember, this is a holiday three day weekend, so I most likely won't be able be on top of this, it may have to wait until Tuesday for conclusion of Phase 1, as I do have a family that I must interact with.

WanderingAries · September 5, 2021

I didn't Mean to do it I swear! 😛

_NOPE_ · September 6, 2021

Final file - processed. Now I'm starting to unzip them all. I'll see you sometime tomorrow!

And that's not counting re-zipping them into a 7z file with maximum compression!

Paulus · September 6, 2021

2 hours ago, The Philotic Knight said:

Final file - processed. Now I'm starting to unzip them all. I'll see you sometime tomorrow!

And that's not counting re-zipping them into a 7z file with maximum compression!

You do amazing work, I’m thoroughly impressed by your knowledge and creativity in getting this thing done.

Can’t wait for part two, my CPU cores are ready.

_NOPE_ · September 6, 2021

Status update before I take my family out for family day:

As you can see, the 7Zip process has been running over 90 minutes so far, and it hasn't even STARTED compressing yet. It's still at the stage where it's indexing the existing files before compression. It's up to thread range 19480 as you can see above in an hour and a half of indexing. Here's my "dir *.*" of the folder, so you can see what the ranges go up to:

So.... YEAH..... it's about maybe 65% done just INDEXING the files, and has already found over 5 MILLION files. Kind of show the perspective of why this has been such a process. I'm going to leave it running all day. MAYBE by the time I get home tonight, I might have started to actually compress the files into an archive.... we'll see...

It's times like this that I wish I had money for a supercomputer...

_NOPE_ · September 6, 2021

Evening update:

Looks like 7zip estimates that with "Ultra" compression, it'll be able to compress the files down to about 4% of their original size. So the final file (if 7Zip is correct in its estimations) should be less than 30GB large, for the whole forums (including duplicate files!). This process apparently will take another 63 hours, so I'll see you in three days! 😛

WanderingAries · September 6, 2021

I'd assume the dupes are really just the avatars? How does it know things are duped for sure?

_NOPE_ · September 7, 2021

3 minutes ago, WanderingAries said:

I'd assume the dupes are really just the avatars? How does it know things are duped for sure?

What are you calling "it"? 7Zip? It has no clue, it's doing nothing with "dupes".

I haven't written the program yet to "trim the fat". The first release, the Phase 1 release will be ALL of the files that were parsed, dupes and all. That's what's zipping now. The "pure" results from the parsing project before I start monkeying with it and trying to "fix" things.

Then I'm going to write a "mini-app" to do two things at once: Remove the dupes by filename, while at the same time build the dictionary file for use during the rest of Phase 2. This dictionary will also be used for the eventual Site Map.

WanderingAries · September 7, 2021

1 hour ago, The Philotic Knight said:

What are you calling "it"? 7Zip? It has no clue, it's doing nothing with "dupes".

I haven't written the program yet to "trim the fat". The first release, the Phase 1 release will be ALL of the files that were parsed, dupes and all. That's what's zipping now. The "pure" results from the parsing project before I start monkeying with it and trying to "fix" things.

Then I'm going to write a "mini-app" to do two things at once: Remove the dupes by filename, while at the same time build the dictionary file for use during the rest of Phase 2. This dictionary will also be used for the eventual Site Map.

Aside from how it compresses (saving space by identifying patterns), I didn't figure it was smart enough to do that, but I guess I misread what was stated. I can just imagine the fun it'll be to create something to keep the links intact for the duplicate files though. X.X

_NOPE_ · September 9, 2021

Project Status Update - The files are now compressed at maximum compression into a single zip file. I also zipped them as well into a "zip of zips", in case anyone wants to have something a BIT more manageable.

What I'm doing now is writing a "mini-app" to scan all of the directories and files, find all of the "Credit" files, tabulate the totals and create a "Credits" HTML file. This file will show the final credits sorted by both file size and quantity, as well as containing a total list of every file that was processed.

At the same time, my system is copying the items from the output folder over to a Phase 2 folder to begin Phase 2 without messing with the original output - just in case.

Thank goodness I bought that 10TB hard drive! Slow, but at least I have the space to do all this!

WanderingAries · September 10, 2021

10 hours ago, The Philotic Knight said:

Thank goodness I bought that 10TB hard drive! Slow, but at least I have the space to do all this!

That reminds me, I wonder when those 20-30 Tb drives release?

SuperPlyx · September 10, 2021

13 minutes ago, WanderingAries said:

I wonder when those 20-30 Tb drives release?

Get you a 100TB SSD from Nimbus only cost around $40,000

_NOPE_ · September 10, 2021

13 hours later....

Quote

Exception thrown: 'System.OutOfMemoryException' in System.Data.dll

...well... f***.

I was running it as "AnyCPU". Not sure what the default architecture is when you do that. I'd assumed that it'd be x64 and only fall back on x86 if it had no choice, but maybe that's wrong. I'm going to force the issue and try running it again with a forced x64 architecture.

Sign In

Project Spelunker

Recommended Posts

WanderingAries

_NOPE_

_NOPE_

_NOPE_

_NOPE_

_NOPE_

Vanden

_NOPE_

WanderingAries

WanderingAries

_NOPE_

_NOPE_

_NOPE_

WanderingAries

_NOPE_

Paulus

_NOPE_

_NOPE_

WanderingAries

_NOPE_

WanderingAries

_NOPE_

WanderingAries

SuperPlyx

_NOPE_

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Game Account

Wikis