WanderingAries Posted June 15, 2021 Posted June 15, 2021 8 minutes ago, The Philotic Knight said: Keep it up and I'm tellin' Uncle George! AnyWay, you may wanna see if the last 5 are stuck in the queue. Also, the total =/= reported instances remaining. At this point, you can't start a new instance as it'll basically tell you nothing's left. OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 15, 2021 Author Posted June 15, 2021 They're stuck. I just unstuck them. Do one at a time now. If they still stick, I got stuff to look at. I'm out.
WanderingAries Posted June 15, 2021 Posted June 15, 2021 Figured, Next! OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
WanderingAries Posted June 15, 2021 Posted June 15, 2021 (edited) I just tried to run one in 8.1 and one in 10, but both quickly gave this error: There was an error in the application. Please copy and paste the following text in a message to the program author: System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-26219-20120906-072319.warc' ---> System.ApplicationException: Target file /processed/boards.cityofheroes.com-threads-range-26219-20120906-072319.warc already exists at COH_WARC_Processor.FTPclient.FtpRename(String sourceFilename, String newName) in C:\Projects\COH WARC Handler\WHB WARC Processor\FTP Client.cs:line 580 at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 140 --- End of inner exception stack trace --- Trying just one on Win10 now Edited June 15, 2021 by WanderingAries OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 15, 2021 Author Posted June 15, 2021 2 minutes ago, WanderingAries said: I just tried to run one in 8.1 and one in 10, but both quickly gave this error: There was an error in the application. Please copy and paste the following text in a message to the program author: System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-26219-20120906-072319.warc' ---> System.ApplicationException: Target file /processed/boards.cityofheroes.com-threads-range-26219-20120906-072319.warc already exists at COH_WARC_Processor.FTPclient.FtpRename(String sourceFilename, String newName) in C:\Projects\COH WARC Handler\WHB WARC Processor\FTP Client.cs:line 580 at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 140 --- End of inner exception stack trace --- Go ahead and try again. There were duplicates in both directories, not sure how that happened. But should be fixed now. May happen with the other files. I'm out.
WanderingAries Posted June 15, 2021 Posted June 15, 2021 (edited) Current error There was an error in the application. Please copy and paste the following text in a message to the program author: System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-14503-20120911-094551.warc' ---> System.Exception: Failed to move file to process from '/input/boards.cityofheroes.com-threads-range-14503-20120911-094551.warc' to '/processing/boards.cityofheroes.com-threads-range-14503-20120911-094551.warc'. at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 105 --- End of inner exception stack trace --- Now grabbing boards.cityofheroes.com-threads-range-14589-20120911-131326.warc Edited June 15, 2021 by WanderingAries OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 15, 2021 Author Posted June 15, 2021 Probably because it's already there... are you still running multiple instances??? I'm out.
WanderingAries Posted June 15, 2021 Posted June 15, 2021 1 minute ago, The Philotic Knight said: Probably because it's already there... are you still running multiple instances??? There's a possibility it was run before clearing temp files as well...cleared them and ran with the above newer file OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
WanderingAries Posted June 15, 2021 Posted June 15, 2021 (edited) Oh wow, 1.6 Gb file. That'll take a minute to process. 😛 God this is painful to watch inside a VM. 8 Gb RAM allocated and nothing else running in the VM. Edited June 15, 2021 by WanderingAries OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
WanderingAries Posted June 16, 2021 Posted June 16, 2021 (edited) Came back to check on it and got this. Gonna shutdown the VM and run it on the main in a bit if the file is accessible then. There was an error in the application. Please copy and paste the following text in a message to the program author: System.Exception: Error while processing file 'boards.cityofheroes.com-threads-range-14589-20120911-131326.warc' ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at Ionic.Zlib.ZlibBaseStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\ZlibBaseStream.cs:line 149 at Ionic.Zlib.DeflateStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\DeflateStream.cs:line 617 at Ionic.Crc.CrcCalculatorStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\projects\dotnetzip-semverd\src\Zlib.Shared\CRC32.cs:line 710 at Ionic.Zip.ZipEntry._WriteEntryData(Stream s) in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipEntry.Write.cs:line 1460 at Ionic.Zip.ZipEntry.Write(Stream s) in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipEntry.Write.cs:line 2220 at Ionic.Zip.ZipFile.Save() in C:\projects\dotnetzip-semverd\src\Zip.Shared\ZipFile.Save.cs:line 168 at COH_WARC_Processor.MainForm.BtnProcess_Click(Object sender, EventArgs e) in C:\Projects\COH WARC Handler\WHB WARC Processor\MainForm.cs:line 125 --- End of inner exception stack trace --- And nope, no file available it says, but it also says there are 6 pending with only 4 remaining according to the Max-Complete equation. Edited June 16, 2021 by WanderingAries OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 16, 2021 Author Posted June 16, 2021 Let me just handle the final few, alright? I'm out.
WanderingAries Posted June 16, 2021 Posted June 16, 2021 5 minutes ago, The Philotic Knight said: Let me just handle the final few, alright? "But Moooooom" :p OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 17, 2021 Author Posted June 17, 2021 Project Update: Thanks to everyone's help (mostly @WanderingAries I suspect, but thank you everyone!), all files have now gone through the first phase of the process, where the WARC data is extracted from the WARC files, and into the original files. There were however, a few exceptions that I will need to look at manually - files that ended up being completely empty on output (with the exception of the "credit" file that I'm generating for each zip): Also, there may be SOME files that didn't get zipped properly, and I'll only know this after I go through the 12,710 files and try to extract them one at a time to my local PC. I'm going to automate this process on my end, and keep a log of all files that fail to extract, then investigate those. I will let the community know when and if I need assistance with further phases of the project, but for the meantime, I'm probably going to "go dark" for a while while I work on this, as there won't be anything useful to report. Thank you again to everyone for your assistance, that was a HUGE accomplishment, just getting those original files extracted. Now the next steps are to figure out what to DO with those files. As I've said before, but I'll repeat it here, here's the rest of my plan, in a bit more detail: Validate the veracity of the data (what I mentioned above with the "exception" files) and correct any errors Determine the appropriate file structure - if these can all just be dumped into one folder, or if that'll cause errors just due to the massive number of files - they may need to each be placed in a separate folder to stand on their own separately, and then linked appropriately. Standardize all of the file names, then standardize URL references to those files to match the local filenames (this step may require more borrowed processing time once I know I have a working process!). This may require having an algorithm perform a "search" for the appropriately named file in the local system amongst many subdirectories (as mentioned in step 2 above). Submit all of this to Google to determine if they can index them If Google can't then I'll have to create my own method/algorithm for searching these files. So, please sit tight, this... could take a while. Plus, I've still got my day job that, you know, pays me money... so I gotta give them some time too. 😛 3 1 I'm out.
WanderingAries Posted June 18, 2021 Posted June 18, 2021 On 6/17/2021 at 8:55 AM, The Philotic Knight said: I will let the community know when and if I need assistance with further phases of the project, but for the meantime, I'm probably going to "go dark" for a while while I work on this, as there won't be anything useful to report. "Are we there yet?" OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 20, 2021 Author Posted June 20, 2021 Alright, so those last ten files that failed, were also the ten largest files, no surprise. They failed, because they are larger than 2GB big. "Ahhh", says a wizened and experienced programmer, "you must have just faced the 32 bit limitation! Just recompile and run the program in 64 bit!" Normally, yes, you'd be right, Chuckles. But not this time. I've been reading the files byte by byte, and putting them into a byte array for processing. And, sadly, a bye array, in fact an array of ANY object is limited by the definition of an array itself, which is defined by... dun dun DUNNNNNNN..... the int32 numerical type. Which... is limited to 2 GB. https://docs.microsoft.com/en-us/dotnet/api/system.array.length?view=net-5.0 I have to think of another approach to handle these larger, probably more important, files. And I think I have an idea, relating to using a constant stream to split the file into smaller chunks by using the WARC header marker as a delimiter. But that will have to wait until Monday at least. I'm out.
_NOPE_ Posted June 20, 2021 Author Posted June 20, 2021 A fascinating discussion that has taken place over two years on a fix for this: https://github.com/dotnet/runtime/issues/12221 A lot of people WAY smarter than me thinking ALOT about this. I'm out.
_NOPE_ Posted June 23, 2021 Author Posted June 23, 2021 I've written my method today to attempt to extract all of the data from the ten largest WARC files, by doing the following: Opening the file via a StreamReader, and reading it line by line, rather than all at once. When I come across a line that doesn't start with "WARC/" (the new record identifier), I'm just adding the line to a Stringbuilder that's accumulating the lines found thus far. When I DO come across a new record marker, I dump the Stringbuilder to a sequentially numbered file, and start a new one. I also add it to a list of files to process further When I'm done "splitting" the WARC file into sub-files, then I'm going to go through that list of files using the same methods that everyone was using to help me extract the source files. I just ran this on the first large file, and... holy buckets! It contained over 70,000 WARC records in that one file! This.... could take a while. 😮 I'm out.
_NOPE_ Posted June 23, 2021 Author Posted June 23, 2021 I was able to "extract" the files from the largest WARC file - however, the images aren't showing properly, so I have an error to troubleshoot. In other news, all OTHER files can now be browsed at http://cohforums.cityofplayers.com/files/ Almost the entire forum is there... it's now just a matter of making it searchable! 2 1 I'm out.
WanderingAries Posted June 24, 2021 Posted June 24, 2021 22 hours ago, The Philotic Knight said: I've written my method today Silly question maybe, but how were these compacted in the first place? OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted June 24, 2021 Author Posted June 24, 2021 43 minutes ago, WanderingAries said: Silly question maybe, but how were these compacted in the first place? It wasn't "compacted", not at all. The difficulty was... well, read this and you might understand: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/ Add to this the fact that the internet archive added their very own HTTP header on TOP of the warc format, and it became... challenging... to parse it properly and accurately. Add to THAT the fact that these large files had to be parsed as a continuous stream, rather than all at once as I did with all the rest of the files, and I hope you can understand the challenges. I'm out.
WanderingAries Posted June 24, 2021 Posted June 24, 2021 So layered (HTML speaking), but not compacted (Zip, etc). K, not crystal but the gyst. OG Server: Pinnacle <||> Current Primary Server: Torchbearer || Also found on the others if desired <||> Generally Inactive Installing CoX: Windows || MacOS || MacOS for M1 <||> Migrating Data from an Older Installation Clubs: Mid's Hero Designer || PC Builders || HC Wiki || Jerk Hackers Old Forums <||> Titan Network <||> Heroica! (by @Shenanigunner)
_NOPE_ Posted July 7, 2021 Author Posted July 7, 2021 <Inserting cryptic comment here:> You guys have NO idea just how happy I am to see this image: Good News and Bad News coming Soon™ Which do you want to hear first? 😄 I'm out.
Hardship Posted July 7, 2021 Posted July 7, 2021 The Good News. I get enought bad news at work every day.
_NOPE_ Posted July 7, 2021 Author Posted July 7, 2021 15 minutes ago, Hardship said: The Good News. I get enought bad news at work every day. The good news is that after many hours of banging my head on a concrete wall, I finally sussed out how to correctly process large WARC files with NO errors in the data content, which means that HTML files no longer have random checksum digits arbitrarily strewn throughout, and binary files (like images) now show up correctly. You MAY recognize a few of these... 2 I'm out.
_NOPE_ Posted July 7, 2021 Author Posted July 7, 2021 Here's some that I can recognize right away from the old forums... @Heraclea Starflier Backasswards Dragonberry The infamous Golden Girl @Christopher Robin @Oubliette_Red .... oh wait, they're still using that one! 😛 ... wasn't this one Memphis Bill's? I know I remember seeing this one! @Zekiran Immortal @Samuel Tow @Hyperstrike hasn't changed his either (not that I can point fingers either... heh) @Samuraiko also hasn't changed hers in all these years! Oh, and who could forget these things around the time Going Rogue came out: Oh, this is kind of a cool find: 4 I'm out.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now