Jump to content

Project Spelunker


_NOPE_

Recommended Posts

8 minutes ago, The Philotic Knight said:

George Takei Oh My GIF - GeorgeTakei OhMy Wink - Discover & Share GIFs

 

Keep it up and I'm tellin' Uncle George!

 

AnyWay, you may wanna see if the last 5 are stuck in the queue. Also, the total =/= reported instances remaining. At this point, you can't start a new instance as it'll basically tell you nothing's left.

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Figured, Next!

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

I just tried to run one in 8.1 and one in 10, but both quickly gave this error:

 

There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-26219-20120906-072319.warc' ---> System.ApplicationException: Target file /processed/boards.cityofheroes.com-threads-range-26219-20120906-072319.warc already exists
   at COH_WARC_Processor.FTPclient.FtpRename(String sourceFilename, String newName) in C:\Projects\COH WARC Handler\WHB WARC Processor\FTP Client.cs:line 580
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 140
   --- End of inner exception stack trace ---

 

Trying just one on Win10 now

Edited by WanderingAries

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

2 minutes ago, WanderingAries said:

I just tried to run one in 8.1 and one in 10, but both quickly gave this error:

 


There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-26219-20120906-072319.warc' ---> System.ApplicationException: Target file /processed/boards.cityofheroes.com-threads-range-26219-20120906-072319.warc already exists
   at COH_WARC_Processor.FTPclient.FtpRename(String sourceFilename, String newName) in C:\Projects\COH WARC Handler\WHB WARC Processor\FTP Client.cs:line 580
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 140
   --- End of inner exception stack trace ---

 

Go ahead and try again. There were duplicates in both directories, not sure how that happened. But should be fixed now. May happen with the other files.

I'm out.
Link to comment
Share on other sites

Current error

 

There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-14503-20120911-094551.warc' ---> System.Exception: Failed to move file to process from '/input/boards.cityofheroes.com-threads-range-14503-20120911-094551.warc' to '/processing/boards.cityofheroes.com-threads-range-14503-20120911-094551.warc'.
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 105
   --- End of inner exception stack trace ---

 

Now grabbing boards.cityofheroes.com-threads-range-14589-20120911-131326.warc

Edited by WanderingAries

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

1 minute ago, The Philotic Knight said:

Probably because it's already there... are you still running multiple instances???

 

There's a possibility it was run before clearing temp files as well...cleared them and ran with the above newer file

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Oh wow, 1.6 Gb file. That'll take a minute to process. 😛

 

God this is painful to watch inside a VM. 8 Gb RAM allocated and nothing else running in the VM.

 

1474476244_ScreenShot2021-06-15at7_06_24PM.png.05566b6ea9deee1c5e965d45951b805a.png

Edited by WanderingAries

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Came back to check on it and got this. Gonna shutdown the VM and run it on the main in a bit if the file is accessible then.

 

There was an error in the application. Please copy and paste the following text in a message to the program author:
System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-14589-20120911-131326.warc' ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at Ionic.Zlib.ZlibBaseStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\ZlibBaseStream.cs:line 149
   at Ionic.Zlib.DeflateStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\DeflateStream.cs:line 617
   at Ionic.Crc.CrcCalculatorStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\CRC32.cs:line 710
   at Ionic.Zip.ZipEntry._WriteEntryData(Stream s) in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipEntry.Write.cs:line 1460
   at Ionic.Zip.ZipEntry.Write(Stream s) in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipEntry.Write.cs:line 2220
   at Ionic.Zip.ZipFile.Save() in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipFile.Save.cs:line 168
   at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 125
   --- End of inner exception stack trace ---

 

And nope, no file available it says, but it also says there are 6 pending with only 4 remaining according to the Max-Complete equation.

Edited by WanderingAries

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

5 minutes ago, The Philotic Knight said:

Let me just handle the final few, alright?

 

"But Moooooom" :p

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Project Update: Thanks to everyone's help (mostly @WanderingAries I suspect, but thank you everyone!), all files have now gone through the first phase of the process, where the WARC data is extracted from the WARC files, and into the original files.

 

There were however, a few exceptions that I will need to look at manually - files that ended up being completely empty on output (with the exception of the "credit" file that I'm generating for each zip):

image.png.578453e6d9b7e6dc7fd726424569aaca.png

 

Also, there may be SOME files that didn't get zipped properly, and I'll only know this after I go through the 12,710 files and try to extract them one at a time to my local PC. I'm going to automate this process on my end, and keep a log of all files that fail to extract, then investigate those.

 

I will let the community know when and if I need assistance with further phases of the project, but for the meantime, I'm probably going to "go dark" for a while while I work on this, as there won't be anything useful to report.

 

Thank you again to everyone for your assistance, that was a HUGE accomplishment, just getting those original files extracted. Now the next steps are to figure out what to DO with those files.

 

As I've said before, but I'll repeat it here, here's the rest of my plan, in a bit more detail:

  1. Validate the veracity of the data (what I mentioned above with the "exception" files) and correct any errors
  2. Determine the appropriate file structure - if these can all just be dumped into one folder, or if that'll cause errors just due to the massive number of files - they may need to each be placed in a separate folder to stand on their own separately, and then linked appropriately.
  3. Standardize all of the file names, then standardize URL references to those files to match the local filenames (this step may require more borrowed processing time once I know I have a working process!). This may require having an algorithm perform a "search" for the appropriately named file in the local system amongst many subdirectories (as mentioned in step 2 above).
  4. Submit all of this to Google to determine if they can index them
  5. If Google can't then I'll have to create my own  method/algorithm for searching these files.

 

So, please sit tight, this... could take a while. Plus, I've still got my day job that, you know, pays me money... so I gotta give them some time too. 😛

 

 

  • Like 3
  • Thanks 1
I'm out.
Link to comment
Share on other sites

On 6/17/2021 at 8:55 AM, The Philotic Knight said:

I will let the community know when and if I need assistance with further phases of the project, but for the meantime, I'm probably going to "go dark" for a while while I work on this, as there won't be anything useful to report.

 

"Are we there yet?"

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

Alright, so those last ten files that failed, were also the ten largest files, no surprise.

 

They failed, because they are larger than 2GB big. "Ahhh", says a wizened and experienced programmer, "you must have just faced the 32 bit limitation! Just recompile and run the program in 64 bit!"

 

Normally, yes, you'd be right,  Chuckles. But not this time. I've been reading the files byte by byte, and putting them into a byte array for processing. 

 

And, sadly, a bye array, in fact an array of ANY object is limited by the definition of an array itself, which is defined by... dun dun DUNNNNNNN..... the int32 numerical type.

 

Which... is limited to 2 GB.

 

https://docs.microsoft.com/en-us/dotnet/api/system.array.length?view=net-5.0

 

I have to think of another approach to handle these larger, probably more important, files.

 

And I think I have an idea, relating to using a constant stream to split the file into smaller chunks by using the WARC header marker as a delimiter.

 

But that will have to wait until Monday at least.

I'm out.
Link to comment
Share on other sites

I've written my method today to attempt to extract all of the data from the ten largest WARC files, by doing the following:

  1. Opening the file via a StreamReader, and reading it line by line, rather than all at once.
  2. When I come across a line that doesn't start with "WARC/" (the new record identifier), I'm just adding the line to a Stringbuilder that's accumulating the lines found thus far.
  3. When I DO come across a new record marker, I dump the Stringbuilder to a sequentially numbered file, and start a new one. I also add it to a list of files to process further

When I'm done "splitting" the WARC file into sub-files, then I'm going to go through that list of files using the same methods that everyone was using to help me extract the source files.

 

I just ran this on the first large file, and... holy buckets! It contained over 70,000 WARC records in that one file!

image.png.a3cef0664856ff2af760a9917eb5b9ea.png

 

This.... could take a while. 😮

I'm out.
Link to comment
Share on other sites

22 hours ago, The Philotic Knight said:

I've written my method today

 

Silly question maybe, but how were these compacted in the first place?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

43 minutes ago, WanderingAries said:

 

Silly question maybe, but how were these compacted in the first place?

It wasn't "compacted", not at all.

 

The difficulty was... well, read this and you might understand:

 

https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/

 

Add to this the fact that the internet archive added their very own HTTP header on TOP of the warc format, and it became... challenging... to parse it properly and accurately. Add to THAT the fact that these large files had to be parsed as a continuous stream, rather than all at once as I did with all the rest of the files, and I hope you can understand the challenges. 

 

I'm out.
Link to comment
Share on other sites

So layered (HTML speaking), but not compacted (Zip, etc). K, not crystal but the gyst.

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

  • 2 weeks later
15 minutes ago, Hardship said:

The Good News.  I get enought bad news at work every day.

 

The good news is that after many hours of banging my head on a concrete wall, I finally sussed out how to correctly process large WARC files with NO errors in the data content, which means that HTML files no longer have random checksum digits arbitrarily strewn throughout, and binary files (like images) now show up correctly. You MAY recognize a few of these...

 

image.thumb.png.c7db072b2d42a5d31c5eedd6d7a8e8fe.png

  • Like 2
I'm out.
Link to comment
Share on other sites

Here's some that I can recognize right away from the old forums...

 

image.png.e21f1db873e63e58c133f6f4955685e9.png@Heraclea

 

image.png.b94a3c876780e95369ad5324000f7f25.png Starflier

 

image.png.73fbae43c4e7c88edc4416cc2ad50f68.png Backasswards

 

image.png.25213658b17978f206d595d4aae269bd.pngDragonberry

 

image.png.90aad224a59d2ab5f7b8c3425b00f69e.png The infamous Golden Girl

 

image.png.51afb631395bbc92d02e8d8b4e856220.png@Christopher Robin

 

image.png.022d4628e8622b6fa4e745f9406c3a3a.png@Oubliette_Red .... oh wait, they're still using that one! 😛

 

image.png.992012e77d1ffc9273a9d1d143b38e20.png ... wasn't this one Memphis Bill's? I know I remember seeing this one!

 

image.png.640faaacc091588c0a7c7f3e329b4acb.png@Zekiran Immortal

 

image.png.041fb054ba5f4bb506f3f9c8a46c06f1.png@Samuel Tow

 

image.png.becec1a55acac3fd979eaa102f4f64d2.png@Hyperstrike hasn't changed his either (not that I can point fingers either... heh)

 

image.png.def4320ecaa909042864088756dc20f0.png@Samuraiko also hasn't changed hers in all these years!

 

Oh, and who could forget these things around the time Going Rogue came out:

image.png.be00b630d705f84a2b0b092da0b506eb.png

image.png.8105f9490ebc16936e9c4e9df43d6e1e.png

 

Oh, this is kind of a cool find:

image.thumb.png.db278c924b37abbeddb5b2f32a5aee99.png

 

 

 

  • Like 4
I'm out.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...