Project Spelunker

Gobbledygook · July 7, 2019

Maybe, but it looks like it's something that almost nobody wants, if you look at the lack of Patreon response. Homecoming wants to remain hands off, and....

I also did post on the Titan Network forums about two days ago, and got something 43 views, no response. Not very encouraging.

https://www.cohtitan.com/forum/index.php?topic=13560.0

It looks like the only ones that want this are those that have posted here and already contributed processing time. It's a bit disheartening.

I plan on doing the Patreon thing, but it will have to wait a couple of days.

Zep · July 8, 2019

Wanting access and having money are two separate things.

I can contribute a little... just waiting to see how the wind is blowing so to speak. A 1 time fee I am much better at then ongoing.

What would be the possibility of importing all the forum data into a blank site? Or once processed could we make it torrent downloadable? Of course online is best just want to think through every option.

_NOPE_ · July 8, 2019

Hmmm... you gave me an idea... I've got to spend some time thinking about it, but... it's an idea. I'll let you know when I've had more time to process.

WanderingAries · July 8, 2019

In the meantime, couldn't we just keep working on the crunching? Albeit at the slower than you desired pace?

_NOPE_ · July 9, 2019

No, we really can't. The database size has gotten so massive for my host now that every attempt to add a new record fails due to timeout. We're stuck without a new host. But I do have another idea, though it'll probably take me a few weeks to put together.

_NOPE_ · July 24, 2019

So, a little tiny update here... remember my spider that I had running? Well... I never stopped it.

It just finished. Here's a preview:

My plan now is to upload these to my website, and then get Google to just index them, and perhaps add a header page that you can start from that has a Google search bar at the top of it, showing the results in the bottom pane. It might be easier to manage than a full indexed database, by shunting that work and data storage off to Google, while I just host the files (which, by the way, if they turn out to be good files, I'll zip them up and provide them to anyone so that anyone can host a "mirror" of the old CoH forums).

This is what I meant when I said I was looking at alternate paths, since it looks like the Patreon is going to be a bust, and neither Homecoming, nor the Titan Network seems interested in hosting.

Oubliette_Red · July 24, 2019

That's great news PK.

I was willing to support Patreon to the max for a few months if it looked like we'd reach our goals. Sadly that wasn't the case and since I'm currently looking for work I had to cancel.

Zep · July 25, 2019

Good job!

_NOPE_ · July 26, 2019

FYI, I'm still working on a new parser in case the spider's outputs end up being crap that Google can't index, here's a preview of what I have so far, it seems to be working, but I haven't opened up the files to test out the internal links. I'm trying to rewrite the internal links to point to files within the same directory so that in the end, when all is said and done, these files could sit one someone's hard drive, and they could be clicking around on a locally stored copy of the forums instead of one hosted on a website if they wanted:

Zep · July 26, 2019

2 hours ago, The Philotic Knight said:

FYI, I'm still working on a new parser in case the spider's outputs end up being crap that Google can't index, here's a preview of what I have so far, it seems to be working, but I haven't opened up the files to test out the internal links. I'm trying to rewrite the internal links to point to files within the same directory so that in the end, when all is said and done, these files could sit one someone's hard drive, and they could be clicking around on a locally stored copy of the forums instead of one hosted on a website if they wanted:

As compressible as the data is that could work (at least for me) too. Use box and/or a torrent to share it out with indexing.

_NOPE_ · July 31, 2019

Update:

http://www.cityofplayers.com/2019/07/31/patreon-failure-and-cancellation/

_NOPE_ · August 2, 2019

New update, I'm now working on making a generic "WARC Handler" library/DLL to make use of in a new version of the Project Spelunker Parser. What I was doing was WAY too custom and confusing to work with to be honest. Why am I doing this? A few reasons:

Apparently nobody has EVER written a .NET WARC Dll... I have no idea, but I can find libraries in Java and Python, but nothing in the .NET platform, and I think one should exist. So I'll make that part and release it to the world, if anyone wants to use it.
It will simplify my future code for Project Spelunker. I know it's going to be a one-time process, but I can imagine its application in future projects.
It's a learning exercise for me to practice implementation by taking an ISO standard, reading it, attempting to comprehend it, and creating an implementation of it in the .NET environment.
It's a fun challenge!

After I have this written, it should be MUCH easier working with my new "WarcRecord" object to process the files and spit them out onto the user's drives, so that we can continue down the path of getting this project done via web indexing rather than the previous database method that I was attempting. This will (hopefully) mean that I won't have to spring out of pocket $120 a month on a beefier server. Just let Google do the work.

Now, why am I doing this? Because frankly, I don't trust my spider to have a good enough job. I trust the Internet Archive's spider more than I trust my own.

So, long story short (too late!) the project is still being worked on, albeit slowly.

WanderingAries · August 3, 2019

Possibly dumb question, so you're still talking about a local app that talks to a server OR a local app that later sends you results data?

_NOPE_ · August 3, 2019

The way I'm working it now, the program will "extract" the original directory structure and files from the WARC files, and recreate them on your hard drive. And then, instead of sending that as data to a database, it'll send the files to my server via FTP.

_NOPE_ · May 6, 2021

It has begun...

_NOPE_ · May 6, 2021

Progress is slow and steady, but this is just phase 1:

Here's some interesting things that I've found so far:

With a special appearance by @Healix and @Hyperstrike:

_NOPE_ · May 7, 2021

So, I'm about 20% done processing, I'd say the initial processing should be done in about a week, assuming nothing interrupts the process:

And so far, with that 20% processed, I have this many files so far extracted from the WARCs (with my own WARC parser that I built from the ground up):

So, with 20% being 579,873 files taking up 39.745 GB, I estimate that the final total will be about 2,899,365 files taking up roughly 198.728 GB of space.

Now, that's just raw, unprocessed files, written directly from a bytestream. But they'll be out there in file format, and I'll stick them in a web directory and make it searchable by Google, so that's something.

The unfortunate thing is that since I started this in Debug mode, if anything "breaks", I have to start over. But I'm going to take that risk. It's been a few years, what's another couple of weeks?

The second step will be to ATTEMPT to replace all of the internal URLs (those pointing to boards.cityofheroes.com) with relative path URLs that link correctly to the appropriate internal documents. Now, to get the correct filename to replace it with should be easy, because I had already written a "GenerateSanitizedFileName" method that allowed me to turn all of the return URIs into the files' final filenames, so I can just pass the references to that. The trick will be for me to learn enough about the HTML Agility pack to make the swap and replace happen, and I'm not sure how much processing time that whole process will take, given the number of files. While Step 1 here I consider to be reasonable enough to just run on my PC, for step 2, I might have to make a "mini-app" and actually try to resurrect "Project Spelunker" and ask others for their assistance with processing these files. We'll see how that goes.

The third step will be to attempt to correct for any encoding errors. I already notice that a few of the image files won't work because of corruption (most work though), and a few of the HTML files appear to have a couple of extra digits at the very start of them. Not sure what that is, maybe a "checksum" value or something, but I'd like to strip those off if I can figure it out.

We'll have to wait and see how this all pans out. Time will tell.

Grouchybeast · May 8, 2021

This is amazing! Fingers crossed for a successful run.

_NOPE_ · May 11, 2021

FYI, still chugging along. Not quite halfway there, but I would have expected any major errors to have happened by now:

Now, cross my fingers that some app on my computer doesn't decide to perform an unattended "update and restart" without my permission!

WanderingAries · May 12, 2021

12 hours ago, The Philotic Knight said:

FYI, still chugging along. Not quite halfway there, but I would have expected any major errors to have happened by now:

Now, cross my fingers that some app on my computer doesn't decide to perform an unattended "update and restart" without my permission!

Set the process to realtime? Manually pause system updates for like a week? Bribe it with candy?

_NOPE_ · May 14, 2021

So.... uh.... slight update. Remember how it looked Ike I was almost half done?

Yeah, it started slowing down... ALOT.

I couldn't figure out why, but then I remembered that I had pre-sorted the file list from smallest to largest.

So.... we're probably not even 25% done yet.... sorry. 😪

WanderingAries · May 14, 2021

That's ok, historically,computers can't estimate time properly either. 😉

_NOPE_ · May 17, 2021

So.... good news/bad news.

The bad news is, yes, my system restarted for an automatic update. The good news is, a whole lot of files got extracted before that, so while I reconsider my code and recode it to make it a more sustainable process, I'm also in the process of copying what I HAVE extracted so far over to http://oldcohforums.cityofplayers.com/

So now you can at least get a "taste" of what I've "scavenged" so far. There's THOUSANDS of files, and frankly, my computer may not have the storage capacity to hold them all and keep processing in the same directory without breaking... so, I'm reconsidering how I process these. I may have to make a subfolder for each one of the WARC files, so that everything's not just in one GIANT folder... I'll have to think on this for a bit. I've also got to figure out if Google can just start indexing these things, or if I have to do anything special to make it index them. Otherwise, it's just a bunch of loose files sitting in a directory, and finding the content that you want would be like a needle in a haystack!

More to come... Soon™.

WanderingAries · May 17, 2021

If you're looking for a little extra storage... O>:p

_NOPE_ · May 18, 2021

I might have to consider something like that @WanderingAries, my system seems to be super slow for what I'm trying to accomplish.

However, I think it may also be more efficient to extract the files, transfer them to the server, and then delete the original source files when there's a successful transfer. That way, there's less files just sitting around doing nothing. I may also "crowdsource" this like I tried to do last time with the database solution that failed.

By the way, in the meantime, I'm still uploading what I have so far, over a million files and still copying:

Sign In

Project Spelunker

Recommended Posts

Gobbledygook

Zep

_NOPE_

WanderingAries

_NOPE_

_NOPE_

Oubliette_Red

Zep

_NOPE_

Zep

_NOPE_

_NOPE_

WanderingAries

_NOPE_

_NOPE_

_NOPE_

_NOPE_

Grouchybeast

_NOPE_

WanderingAries

_NOPE_

WanderingAries

_NOPE_

WanderingAries

_NOPE_

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Game Account

Wikis