Jump to content

Project Spelunker


_NOPE_

Recommended Posts

Maybe, but it looks like it's something that almost nobody wants, if you look at the lack of Patreon response. Homecoming wants to remain hands off, and....

 

I also did post on the Titan Network forums about two days ago, and got something 43 views, no response. Not very encouraging.

 

https://www.cohtitan.com/forum/index.php?topic=13560.0

 

It looks like the only ones that want this are those that have posted here and already contributed processing time. It's a bit disheartening.

 

I plan on doing the Patreon thing, but it will have to wait a couple of days.

Link to comment
Share on other sites

Wanting access and having money are two separate things.

 

I can contribute a little... just waiting to see how the wind is blowing so to speak. A 1 time fee I am much better at then ongoing.

 

What would be the possibility of importing all the forum data into a blank site? Or once processed could we make it torrent downloadable? Of course online is best just want to think through every option.

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Link to comment
Share on other sites

In the meantime, couldn't we just keep working on the crunching? Albeit at the slower than you desired pace?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

No, we really can't. The database size has gotten so massive for my host now that every attempt to add a new record fails due to timeout. We're stuck without a new host. But I do have another idea, though it'll probably take me a few weeks to put together.

I'm out.
Link to comment
Share on other sites

  • 3 weeks later

So, a little tiny update here... remember my spider that I had running? Well... I never stopped it.

It just finished. Here's a preview:

image.thumb.png.de1a5d432f1a0185d0b39381bc88981b.png

My plan now is to upload these to my website, and then get Google to just index them, and perhaps add a header page that you can start from that has a Google search bar at the top of it, showing the results in the bottom pane. It might be easier to manage than a full indexed database, by shunting that work and data storage off to Google, while I just host the files (which, by the way, if they turn out to be good files, I'll zip them up and provide them to anyone so that anyone can host a "mirror" of the old CoH forums).

This is what I meant when I said I was looking at alternate paths, since it looks like the Patreon is going to be a bust, and neither Homecoming, nor the Titan Network seems interested in hosting.

I'm out.
Link to comment
Share on other sites

That's great news PK.

I was willing to support Patreon to the max for a few months if it looked like we'd reach our goals. Sadly that wasn't the case and since I'm currently looking for work I had to cancel.

Dislike certain sounds? Silence/Modify specific sounds. Looking for modified whole powerset sfx?

Check out Michiyo's modder or Solerverse's thread.  Got a punny character? You should share it.

Link to comment
Share on other sites

Good job!

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Link to comment
Share on other sites

FYI, I'm still working on a new parser in case the spider's outputs end up being crap that Google can't index, here's a preview of what I have so far, it seems to be working, but I haven't opened up the files to test out the internal links. I'm trying to rewrite the internal links to point to files within the same directory so that in the end, when all is said and done, these files could sit one someone's hard drive, and they could be clicking around on a locally stored copy of the forums instead of one hosted on a website if they wanted:

 

image.png.f6211c036504f79d1b0be2d8863b3535.png

I'm out.
Link to comment
Share on other sites

2 hours ago, The Philotic Knight said:

FYI, I'm still working on a new parser in case the spider's outputs end up being crap that Google can't index, here's a preview of what I have so far, it seems to be working, but I haven't opened up the files to test out the internal links. I'm trying to rewrite the internal links to point to files within the same directory so that in the end, when all is said and done, these files could sit one someone's hard drive, and they could be clicking around on a locally stored copy of the forums instead of one hosted on a website if they wanted:

 

image.png.f6211c036504f79d1b0be2d8863b3535.png

As compressible as the data is that could work (at least for me) too. Use box and/or a torrent to share it out with indexing. 

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Link to comment
Share on other sites

New update, I'm now working on making a generic "WARC Handler" library/DLL to make use of in a new version of the Project Spelunker Parser. What I was doing was WAY too custom and confusing to work with to be honest. Why am I doing this? A few reasons:

 

  1. Apparently nobody has EVER written a .NET WARC Dll... I have no idea, but I can find libraries in Java and Python, but nothing in the .NET platform, and I think one should exist. So I'll make that part and release it to the world, if anyone wants to use it.
  2. It will simplify my future code for Project Spelunker. I know it's going to be a one-time process, but I can imagine its application in future projects.
  3. It's a learning exercise for me to practice implementation by taking an ISO standard, reading it, attempting to comprehend it, and creating an implementation of it in the .NET environment.
  4. It's a fun challenge!

After I have this written, it should be MUCH easier working with my new "WarcRecord" object to process the files and spit them out onto the user's drives, so that we can continue down the path of getting this project done via web indexing rather than the previous database method that I was attempting. This will (hopefully) mean that I won't have to spring out of pocket $120 a month on a beefier server. Just let Google do the work.

 

Now, why am I doing this? Because frankly, I don't trust my spider to have a good enough job. I trust the Internet Archive's spider more than I trust my own.

 

So, long story short (too late!) the project is still being worked on, albeit slowly.

I'm out.
Link to comment
Share on other sites

Possibly dumb question, so you're still talking about a local app that talks to a server OR a local app that later sends you results data?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

The way I'm working it now, the program will "extract" the original directory structure and files from the WARC files, and recreate them on your hard drive. And then, instead of sending that as data to a database, it'll send the files to my server via FTP.

I'm out.
Link to comment
Share on other sites

  • 1 year later

So, I'm about 20% done processing, I'd say the initial processing should be done in about a week, assuming nothing interrupts the process:

image.png.70f676edb1fa492861b986d061982be2.png

 

And so far, with that 20% processed, I have this many files so far extracted from the WARCs (with my own WARC parser that I built from the ground up):

image.png.b19a5fb5fcd148e63e512424cebb2c0a.png

 

So, with 20% being 579,873 files taking up 39.745 GB, I estimate that the final total will be about 2,899,365 files taking up roughly 198.728 GB of space.

 

Now, that's just raw, unprocessed files, written directly from a bytestream. But they'll be out there in file format, and I'll stick them in a web directory and make it searchable by Google, so that's something.

 

The unfortunate thing is that since I started this in Debug mode, if anything "breaks", I have to start over. But I'm going to take that risk. It's been a few years, what's another couple of weeks?

 

The second step will be to ATTEMPT to replace all of the internal URLs (those pointing to boards.cityofheroes.com) with relative path URLs that link correctly to the appropriate internal documents. Now, to get the correct filename to replace it with should be easy, because I had already written a "GenerateSanitizedFileName" method that allowed me to turn all of the return URIs into the files' final filenames, so I can just pass the references to that. The trick will be for me to learn enough about the HTML Agility pack to make the swap and replace happen, and I'm not sure how much processing time that whole process will take, given the number of files. While Step 1 here I consider to be reasonable enough to just run on my PC, for step 2, I might have to make a "mini-app" and actually try to resurrect "Project Spelunker" and ask others for their assistance with processing these files. We'll see how that goes.

 

The third step will be to attempt to correct for any encoding errors. I already notice that a few of the image files won't work because of corruption (most work though), and a few of the HTML files appear to have a couple of extra digits at the very start of them. Not sure what that is, maybe a "checksum" value or something, but I'd like to strip those off if I can figure it out.

 

We'll have to wait and see how this all pans out. Time will tell.

  • Like 5
I'm out.
Link to comment
Share on other sites

FYI, still chugging along. Not quite halfway there, but I would have expected any major errors to have happened by now:

 

image.png.1d91b81ee8fec4d111d7d2fe72b15bca.png

 

Now, cross my fingers that some app on my computer doesn't decide to perform an unattended "update and restart" without my permission!

  • Like 2
I'm out.
Link to comment
Share on other sites

12 hours ago, The Philotic Knight said:

FYI, still chugging along. Not quite halfway there, but I would have expected any major errors to have happened by now:

 

image.png.1d91b81ee8fec4d111d7d2fe72b15bca.png

 

Now, cross my fingers that some app on my computer doesn't decide to perform an unattended "update and restart" without my permission!

 

Set the process to realtime? Manually pause system updates for like a week? Bribe it with candy?

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

So.... uh.... slight update. Remember how it looked Ike I was almost half done?

 

Yeah, it started  slowing down... ALOT.

 

I couldn't figure out why, but then I remembered that I had pre-sorted the file list from smallest to largest.

 

So.... we're probably not even 25% done yet.... sorry. 😪

  • Haha 2
I'm out.
Link to comment
Share on other sites

That's ok, historically,computers can't estimate time properly either. 😉

 

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

So.... good news/bad news.

 

 

The bad news is, yes, my system restarted for an automatic update. The good news is, a whole lot of files got extracted before that, so while I reconsider my code and recode it to make it a more sustainable process, I'm also in the process of copying what I HAVE extracted so far over to http://oldcohforums.cityofplayers.com/

 

So now you can at least get a "taste" of what I've "scavenged" so far. There's THOUSANDS of files, and frankly, my computer may not have the storage capacity to hold them all and keep processing in the same directory without breaking... so, I'm reconsidering how I process these. I may have to make a subfolder for each one of the WARC files, so that everything's not just in one GIANT folder... I'll have to think on this for a bit. I've also got to figure out if Google can just start indexing these things, or if I have to do anything special to make it index them. Otherwise, it's just a bunch of loose files sitting in a directory, and finding the content that you want would be like a needle in a haystack!

 

More to come... Soon™.

  • Thanks 2
I'm out.
Link to comment
Share on other sites

If you're looking for a little extra storage...  O>:p

OG Server: Pinnacle  <||>  Current Primary Server: Torchbearer  ||  Also found on the others if desired  <||> Generally Inactive


Installing CoX:  Windows  ||  MacOS  ||  MacOS for M1  <||>  Migrating Data from an Older Installation


Clubs: Mid's Hero Designer  ||  PC Builders  ||  HC Wiki  ||  Jerk Hackers


Old Forums  <||>  Titan Network  <||>  Heroica! (by @Shenanigunner)

 

Link to comment
Share on other sites

I might have to consider something like that @WanderingAries, my system seems to be super slow for what I'm trying to accomplish.

 

However, I think it may also be more efficient to extract the files, transfer them to the server, and then delete the original source files when there's a successful transfer. That way, there's less files just sitting around doing nothing. I may also "crowdsource" this like I tried to do last time with the database solution that failed.

 

By the way, in the meantime, I'm still uploading what I have so far, over a million files and still copying:

 

image.png.600575ee0ef0e6fc97ef2a02fbcf27cb.png

I'm out.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...