Jump to content

_NOPE_

Members
  • Posts

    2543
  • Joined

  • Days Won

    7

Everything posted by _NOPE_

  1. Whatever changes I made appears to be working now, because 117 just processed with no issues, and left me with a working copy of Vanden's animated gif avatar:
  2. Here's that final list, by the way, the "Final Four": boards.cityofheroes.com-threads-range-18694-20120904-094534 boards.cityofheroes.com-threads-range-18341-20120905-142114 boards.cityofheroes.com-threads-range-21273-20120904-160852 boards.cityofheroes.com-threads-range-19175-20120905-235117 And, after some modification to my code, I was able to successfully parse boards.cityofheroes.com-threads-range-18694-20120904-094534. We'll have to wait and see about the rest. That one took a few hours to parse, and it's only a 20MB file. I blame the additional visual display of progress for that ridiculous processing time, because even large files gigabytes big didn't even take that long! Working on boards.cityofheroes.com-threads-range-19175-20120905-235117 right now.
  3. This is what happens when you want to make sure things didn't lock up, and you want to SEE a "large" file processing byte by byte. It REALLY slows down the process! This is precisely why I DIDN'T have "byte by byte reporting" on the main app, and only save that for debugging "problem" files. If I hadn't, this process would have taken YEARS time just to update the screen for everyone.
  4. Kind of fun to watch it work live:
  5. Yeah, just cool it for now. Those last four will probably be in and out of the input queue as I'm trying to troubleshoot the issues with them. Patience, sir! In the meantime, trying to re-torrent the original archive from archive.org... has been stuck at 99.9% for like a week now 😞 : So I started trying to download from the HTTP link on that page, and I sure as hell hope that I don't lose connection any time soon, because I'm unsure if that's resumable: 😢
  6. I'm now processing the last four in Debug mode, with byte-by-byte reporting, which is REALLY going to slow this down. But it's the only way to be SURE that it's still working, and to see the progress and know that it's not locked up. Refreshing this display every few hundred bytes makes things REALLY slow, but at least I'll know:
  7. These three files may have some unforseen issue that's making them lock up. I may have to process them manually. 🤨
  8. I'm just going to keep throwing these files back into the input queue over and over again until we either get them processed, or I get an actual coding error, rather than a transmission or "file not found" error (which is caused by multiple people trying to grab the same file usually). Down to THREE! Plus the one I'll have to manually process when I have time.
  9. We're now down to FIVE files left, and one additional one that I'll have to process manually. I've moved the other five back to the input folder. Grab them while you can!
  10. I've moved this one over to my "manual processing" folder, and will get to it when I can. Hell, I think out of 18,000+ WARC files, that contain hundreds of URLs within them, I think not being able to turn ONE URL into a filename is pretty damn good odds there. I may just have to manually name that one myself.
  11. Maybe... let's just throw two or three instances at these last few each, so there's less chance of bumping into each other? 😉
  12. Version 2.3 is out on the server now if you want it. The only change is that I swapped the error out with a message saying basically that the input directory is empty. Another rookie mistake. I'm going through the last few now and will push them into the input folder if they don't have a good valid zip on my end. But I think we're about done here!
  13. I'll push the ones stuck in processing back to the input folder sometime tomorrow. Thanks everyone.
  14. Already back up to 80%, good job everyone!
  15. I am also still re-downloading the ORIGINAL file, just to make TRIPLE sure that I didn't miss anything else, but I've got a food feeling that the 18,000 something number is the correct number of files:
  16. Sooooooo.... It's a really damn good thing that I double checked the original source files a second time, because.... ...ooops.... Found about another 6,000 files that I'd somehow lost between when I started this project two years ago and now. My best guess is that I was using a sample amount for testing and split them out. I've also uploaded a version 2.2 of the parser, if y'all want to get it. The only difference is that I'm calculating the total files based on what's actually IN the directories now, instead of my previous rookie mistake of hard coding a "magic number" of the number of files I THOUGHT there were... I apologize for any inconvenience this may cause. I'm still re-downloading the original file just to make SURE that I'm not missing anything else from my original download, and when I extracted the files from the three layers of archives they stored them under, and then re-archived them into the smaller 7z format.
  17. By the way, these files that we've been working with are the same files that I downloaded over two years ago, from here: https://archive.org/details/archiveteam-city-of-heroes-main Now, in between now, and when I got those files two years ago, there is a chance during all of my testing that I may have misplaced some files. I'm redownloading that same file now, unzipping my original "backup" of the files that I made about a year ago, and will compare all three directories together. If there's any missing files that I find in the original archive from archive.org, I'll add them into the queue.
  18. You'll have to talk with the Mod author @Starhammer about that one. Maybe they can help you troubleshoot it:
  19. Two of those have not had a successful zip upload file @WanderingAries, so their WARCs will remain in the input. The other three did, so I've moved them over to the processed folder. Thanks! 93.3% complete!
  20. I've moved all files from the "processing" queue back to the input queue, so needless to say, you'll probably get a bunch of failures this time around. Please post them here, because they'll probably be erroneous failures. It just means that while you were able to successfully upload the zip file, since I moved the processing file, it couldn't find it to move it to the "processed" folder. I'll move them manually once I've confirmed that we have a good Zip file for the WARC. As it stands right now, we're at 91.53% complete, and I can TASTE it. So close. I don't think I'm going to have to do any manual processing @WanderingAries, as all errors so far have been transmission errors, NOT programming/parsing errors. So I finally got this thing working correctly for all files. We might just have to retry a few that failed to upload or download. Thank you EVERYONE for your work. When I make my "SiteMap" mini-app, I'll also write in there code to grab a final tally of how many files each username processed, and create a "Thanks To" file showing everyone's usernames and totals. I'm excited! 😄
  21. Please see my post from earlier in the thread:
  22. Thanks for that Aries, I've moved the files around as necessary. Current project status - nearly 60% done! After these are all done, I'll go ahead and unzip them all into my current public facing web server, and make a "mini-app" to just go through the raw file list and index every file into a sitemap HTML file and submit to Google for processing. Then things will really start to show up and be searchable. I'll also see if I can zip up the whole thing into a single 7z archive at maximum compression and provide it to everyone as a "Phase 1" file. If there's enough interest, I can make it as a torrent. Then from there, anyone can do anything with these files that they want to on their own, as I move onto my own Phase 2 - removing duplicate documents, and processing the HTML files themselves to swap out all boards.cityofheroes.com or boards.cityofvillains.com references to the relative local path references to the files that can be found in the archive. THAT will in fact take even LONGER than this phase, as it'll have to be on a file-by-file basis, actually scanning all links in each page, and then comparing them to what's out there on the file system (using my own SanitizeFileName method on both sides of the equation of course to make things match up properly). Not sure how I'd even begin to crowd source that thing, as it would create "race conditions" even more than the current process...
  23. @Blood Speaker you should totally make that mod! I'm sure people would love it!
  24. @SuggestorK my thread is only talking about ports needed for MY program to operate, not the game itself.
×
×
  • Create New...