Jump to content

Project Spelunker


_NOPE_

Recommended Posts

 

There's still some in the queue it said and I hadn't read here yet before grabbing, but hit this quickly before just shutting them all down. It looks like we were somehow grabbing the same files too as I had the 117 sitting for a day, but it attempted to redownload it today.

 

There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-21273-20120904-160852.warc' ---> System.Exception: Failed to handle large WARC file ---> System.Exception: Failed to get filename from string 'http://badge-hunter.com/castle_says.php?r1=V%20pehfu%20lbh.&r2=...&r3=Lbh%20yvxr%20guvf.'. ---> System.ArgumentException: String cannot be of zero length.
Parameter name: oldValue
   at System.String.ReplaceInternal(String oldValue, String newValue)
   at System.String.Replace(String oldValue, String newValue)
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 555
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 666
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 241
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 261
   at WHBennett.WarcExtractionRecord..ctor(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 33
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 126
   --- End of inner exception stack trace ---

 

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Yeah, just cool it for now. Those last four will probably be in and out of the input queue as I'm trying to troubleshoot the issues with them. Patience, sir!

 

In the meantime, trying to re-torrent the original archive from archive.org... has been stuck at 99.9% for like a week now 😞 :

image.thumb.png.7b870591b17ee339bafb53b2ba14d797.png

 

So I started trying to download from the HTTP link on that page, and I sure as hell hope that I don't lose connection any time soon, because I'm unsure if that's resumable:

image.png.03421b760937a81c00048ed649a7b997.png

 

😢

I'm out.
Link to comment
Share on other sites

This is what happens when you want to make sure things didn't lock up, and you want to SEE a "large" file processing byte by byte. It REALLY slows down the process!

 

 

This is precisely why I DIDN'T have "byte by byte reporting" on the main app, and only save that for debugging "problem" files. If I hadn't, this process would have taken YEARS time just to update the screen for everyone.

Edited by The Philotic Knight
  • Like 1
I'm out.
Link to comment
Share on other sites

Here's that final list, by the way, the "Final Four":

  • boards.cityofheroes.com-threads-range-18694-20120904-094534
  • boards.cityofheroes.com-threads-range-18341-20120905-142114
  • boards.cityofheroes.com-threads-range-21273-20120904-160852
  • boards.cityofheroes.com-threads-range-19175-20120905-235117

And, after some modification to my code, I was able to successfully parse boards.cityofheroes.com-threads-range-18694-20120904-094534. We'll have to wait and see about the rest. That one took a few hours to parse, and it's only a 20MB file. I blame the additional visual display of progress for that ridiculous processing time, because even large files gigabytes big didn't even take that long!

 

Working on boards.cityofheroes.com-threads-range-19175-20120905-235117 right now.

Edited by The Philotic Knight
  • Like 1
  • Thumbs Up 2
I'm out.
Link to comment
Share on other sites

8 minutes ago, Vanden said:

Neat

Fun fact - I use the largest image file inside of the zip file as the "test" to see if the parsing was a success. Because if ANY file corruption is going to happen, it's going to happen there. And your avatar just happened to be the largest image in that file.

  • Thumbs Up 1
I'm out.
Link to comment
Share on other sites

Yup, those were the problem children alright.

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

On 9/3/2021 at 11:38 AM, The Philotic Knight said:

Here's that final list, by the way, the "Final Four":

  • boards.cityofheroes.com-threads-range-18694-20120904-094534
  • boards.cityofheroes.com-threads-range-18341-20120905-142114
  • boards.cityofheroes.com-threads-range-21273-20120904-160852
  • boards.cityofheroes.com-threads-range-19175-20120905-235117

And, after some modification to my code, I was able to successfully parse boards.cityofheroes.com-threads-range-18694-20120904-094534. We'll have to wait and see about the rest. That one took a few hours to parse, and it's only a 20MB file. I blame the additional visual display of progress for that ridiculous processing time, because even large files gigabytes big didn't even take that long!

 

Working on boards.cityofheroes.com-threads-range-19175-20120905-235117 right now.

 

So...."Are we there yet?"  :p

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Down to the last file with about 60% left to go, when I ran across THIS asshole:

 

On 9/2/2021 at 2:37 PM, WanderingAries said:

 

There's still some in the queue it said and I hadn't read here yet before grabbing, but hit this quickly before just shutting them all down. It looks like we were somehow grabbing the same files too as I had the 117 sitting for a day, but it attempted to redownload it today.

 


There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-21273-20120904-160852.warc' ---> System.Exception: Failed to handle large WARC file ---> System.Exception: Failed to get filename from string 'http://badge-hunter.com/castle_says.php?r1=V%20pehfu%20lbh.&r2=...&r3=Lbh%20yvxr%20guvf.'. ---> System.ArgumentException: String cannot be of zero length.
Parameter name: oldValue
   at System.String.ReplaceInternal(String oldValue, String newValue)
   at System.String.Replace(String oldValue, String newValue)
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 555
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.GetSanitizedFileName(String input, String outputDir) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 666
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 241
   --- End of inner exception stack trace ---
   at WHBennett.WarcExtractionRecord.StreamWARCFile(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 261
   at WHBennett.WarcExtractionRecord..ctor(String sourceFileName, String outputDirectory, ToolStripStatusLabel& tssl) in C:\Projects\COH WARC Handler\WHB WARC Handler\WarcRecord.cs:line 33
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 126
   --- End of inner exception stack trace ---

 

 

 

And even though I'd put a breakpoint right on the line that failed, for some reason Visual Studio would NOT let me unwind that fucker. So, I made the modifications that I "THINK" I'd need to make to turn that URL into a proper filename without fail, and we're starting over with that file from scratch **sigh**. 😢

 

But still... final file, y'all. Then I can unzip them all into their individual sub-folders, then re-zip them into a single 7z archive at the maximum compression level, and release Phase 1 (The Imperfect Collection™) into the wild, while I work on Phase 2, which will be the following:

  • Consolidate the condense the files to remove duplicates (this one I think I have to do on my own), then
  • Go through EVERY single former PHP/now HTML file and replace all boards.cityofheroes.com or boards.cityofvillains.com URL references with file references that have been standardized through my "GetSanitizedFileName" method, and also point those references to the relative local path of that file (if I can find it on the PC after step 1 above!), to make all internal files link to each other successfully.

Now, that second step I really strongly believe I'll have to crowdsource, just to to the sheet massive VOLUME of HTML files there are. So I've been thinking out how I'm going to do that, and I think I know how:

  • Make a "Phase 2" folder, and copy the Phase 1 folder over to the Phase 2 folder, maintaining the folder structure.
  • Make a "mini-app", that scans the Phase 2 directory for all HTML files, and creates a Dictionary file - basically a CSV file with only two "columns" containing the filename as column 1, and the full "relative path" as column 2. Hell, might as well make this dictionary for ALL files, rather than just HTML files.
  • COPY all of the HTML files over into a single new "input" directory for the Phase 2 processor to work with.
  • Modify the former WARC processor to be a Phase 2 processor that behaves in a similar way to how the Phase 1 processor was behaving (downloading, uploading, etc.), except instead of parsing WARCS, it performs the following tasks:
    • Download a copy of my "dictionary" file from step 1 above, and load it into memory
    • Downloads each individual HTML file, one at a time from the input directory, moves the file to a processing directory
    • Using the HTML Agility Pack find all of those file references, one at a time in the downloaded file
    • Nab JUST the "filename" portion of that reference, and standardize it using that same method I used to process the original files in Phase 1
    • Look up that filename in the dictionary in column 1, and if I can find it, swap out the reference in that link with the relative path that I can find in column 2
    • Save the file at the end
  • Once all references are replaced in the HTML file, the original HTML file on the server is moved into the "processed" directory, while the newly "fixed" file is slotted over into its' twin's location back in the "Phase 2" folder in its original relative path (again, using the dictionary to find its relative location)

I think this Phase 2 process will actually end up going MUCH faster than Phase 1 ONCE IT GETS STARTED, but it will be much slower to get started, because of all of the modifications to the original engine I'll have to make, and all of the testing/troubleshooting I'll have to do to get the process "just right".

 

Luckily, I'll keep the original Phase 1 files completely separate from this whole thing to prevent any corrupting from my testing in this next process.

 

Once Phase 2 completes, then it's just a simple matter of making a final "mini-app" that scans the Phase 2 folder, and create a final web index site map file to pass to Google. Then pass it to Google and wait for them to properly index it. The cool thing about this is that if this works the way that I think it'll work, I can bundle this site map along with all of the other Phase 2 files in a second 7z file, release it to the world, and then anyone can host their own local copy of the old CoH forums if they wish!
 

Hell of a road to get there though, and part of me thinks "Was it really worth all this? Or am I Don Quixote tilting at windmills"? Then I remember that even doing this work will help me be a better programmer from sheer practice and figuring out/solving the problem, so the answer is YES, even if only for my own selfish purposes. 🙂

 

  • Like 1
I'm out.
Link to comment
Share on other sites

Latest update on that final problem child:

image.png.020f9c8cbd118f2c83ffc5b9a9d04581.png

 

Now, remember, this is a holiday three day weekend, so I most likely won't be able be on top of this, it may have to wait until Tuesday for conclusion of Phase 1, as I do have a family that I must interact with.

I'm out.
Link to comment
Share on other sites

I didn't Mean to do it I swear! 😛

990180012_ScreenShot2021-09-05at2_02_12PM.png.b556f908094e33117bbc104ea0e66a42.png

 

  • Haha 2

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

2 hours ago, The Philotic Knight said:

Final file - processed. Now I'm starting to unzip them all. I'll see you sometime tomorrow!

 

image.png.23c8928895bb6a9415c715aa49d5d3ce.png

 

And that's not counting re-zipping them into a 7z file with maximum compression!


You do amazing work, I’m thoroughly impressed by your knowledge and creativity in getting this thing done.

 

Can’t wait for part two, my CPU cores are ready.

  • Thanks 1
Link to comment
Share on other sites

Status update before I take my family out for family day:

image.png.16040c8b1a828c82f81f0df16e2e8ca6.png

  

As you can see, the 7Zip process has been running over 90 minutes so far, and it hasn't even STARTED compressing yet. It's still at the stage where it's indexing the existing files before compression. It's up to thread range 19480 as you can see above in an hour and a half of indexing. Here's my "dir *.*" of the folder, so you can see what the ranges go up to:

image.png.86c3acbf197f02dcedc4fcdc4c274bf8.png

 

So.... YEAH..... it's about maybe 65% done just INDEXING the files, and has already found over 5 MILLION files. Kind of show the perspective of why this has been such a process. I'm going to leave it running all day. MAYBE by the time I get home tonight, I might have started to actually compress the files into an archive.... we'll see...

 

It's times like this that I wish I had money for a supercomputer...

I'm out.
Link to comment
Share on other sites

Evening update:

 

image.png.b8dadd6b725ecdbb26c3da4fc6b30667.png

 

Looks like 7zip estimates that with "Ultra" compression, it'll be able to compress the files down to about 4% of their original size. So the final file (if 7Zip is correct in its estimations) should be less than 30GB large, for the whole forums (including duplicate files!). This process apparently will take another 63 hours, so I'll see you in three days! 😛

  • Thumbs Up 2
I'm out.
Link to comment
Share on other sites

I'd assume the dupes are really just the avatars? How does it know things are duped for sure?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

3 minutes ago, WanderingAries said:

I'd assume the dupes are really just the avatars? How does it know things are duped for sure?

What are you calling "it"? 7Zip? It has no clue, it's doing nothing with "dupes".

 

I haven't written the program yet to "trim the fat". The first release, the Phase 1 release will be ALL of the files that were parsed, dupes and all. That's what's zipping now. The "pure" results from the parsing project before I start monkeying with it and trying to "fix" things.

 

Then I'm going to write a "mini-app" to do two things at once: Remove the dupes by filename, while at the same time build the dictionary file for use during the rest of Phase 2. This dictionary will also be used for the eventual Site Map.

  • Like 1
I'm out.
Link to comment
Share on other sites

1 hour ago, The Philotic Knight said:

What are you calling "it"? 7Zip? It has no clue, it's doing nothing with "dupes".

 

I haven't written the program yet to "trim the fat". The first release, the Phase 1 release will be ALL of the files that were parsed, dupes and all. That's what's zipping now. The "pure" results from the parsing project before I start monkeying with it and trying to "fix" things.

 

Then I'm going to write a "mini-app" to do two things at once: Remove the dupes by filename, while at the same time build the dictionary file for use during the rest of Phase 2. This dictionary will also be used for the eventual Site Map.

 

Aside from how it compresses (saving space by identifying patterns), I didn't figure it was smart enough to do that, but I guess I misread what was stated. I can just imagine the fun it'll be to create something to keep the links intact for the duplicate files though.  X.X

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Project Status Update - The files are now compressed at maximum compression into a single zip file. I also zipped them as well into a "zip of zips", in case anyone wants to have something a BIT more manageable.

 

What I'm doing now is writing a "mini-app" to scan all of the directories and files, find all of the "Credit" files, tabulate the totals and create a "Credits" HTML file. This file will show the final credits sorted by both file size and quantity, as well as containing a total list of every file that was processed.

 

At the same time, my system is copying the items from the output folder over to a Phase 2 folder to begin Phase 2 without messing with the original output - just in case.

 

Thank goodness I bought that 10TB hard drive! Slow, but at least I have the space to do all this!

  • Thanks 1
  • Thumbs Up 3
I'm out.
Link to comment
Share on other sites

10 hours ago, The Philotic Knight said:

Thank goodness I bought that 10TB hard drive! Slow, but at least I have the space to do all this!

 

That reminds me, I wonder when those 20-30 Tb drives release?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

13 hours later....

 

Quote

Exception thrown: 'System.OutOfMemoryException' in System.Data.dll

 

...well... f***.

 

I was running it as "AnyCPU". Not sure what the default architecture is when you do that. I'd assumed that it'd be x64 and only fall back on x86 if it had no choice, but maybe that's wrong. I'm going to force the issue and try running it again with a forced x64 architecture.

I'm out.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...