Jump to content

Recommended Posts

Posted

I can, and will, thank you just for trying. I appreciate it and I am sure others will as well.

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Posted

A bit farther, sadly... some of the XML files from the spider got corrupted, and had to be deleted. Here's the latest:

 

(9310/61242) 15.2%

 

But at least we know the start and stop of it... I think. We're in this for the long haul.

I'm out.
Posted

But at least we know the start and stop of it... I think. We're in this for the long haul.

 

PK, there's an entire back-up of the public CoH forums in static form.  Would it be easier to work with that -- or just instantiate them into a new BB?

 

I was not aware that that existed! Thanks for the link! I'll get to looking at that as soon as I get some time to, that may be the best way going forward, thank you! MmcrG7p.gif

I'm out.
Posted

But at least we know the start and stop of it... I think. We're in this for the long haul.

 

PK, there's an entire back-up of the public CoH forums in static form.  Would it be easier to work with that -- or just instantiate them into a new BB?

Dang, 219Gb.

Defender Smash!

Posted

Yeah... but it looks like there's a torrent.. and I DO have a terabyte hard drive... so it shouldn't be an issue... I bet it'll download faster than the spider can crawl the Wayback Engine!

I'm out.
Posted

I've got the sucker downloaded it, decompressing the archives within the archives now.... ETA for just decompression.... 9 hours.

 

But hey, at least I have the data. I've got about 487 GB free on my terabyte hard drive, and with it looking to have about a 20% compression rate, I should be good to go to get the whole thing uncompressed.

 

Then I have to learn how the WARC file format works, and how to work with it... Then I have to look at the data and make a program to parse it and upsert it into a database... no ETA.

I'm out.
Posted

Thanks!

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Posted

I've got the sucker downloaded it, decompressing the archives within the archives now.... ETA for just decompression.... 9 hours.

 

But hey, at least I have the data. I've got about 487 GB free on my terabyte hard drive, and with it looking to have about a 20% compression rate, I should be good to go to get the whole thing uncompressed.

 

Then I have to learn how the WARC file format works, and how to work with it... Then I have to look at the data and make a program to parse it and upsert it into a database... no ETA.

 

Is that stuff you'd have had to do with the wayback machine download, too?

Posted

Nope, because with the way back machine scraping, I'd be working with the RAW files instead of the Internet Archives' WARC format. I still have the scraping happening, but it will probably be slower.

I'm out.
Posted

Okay, so I had a few minutes today to look at the files. It looks like they are straight up flat text files, that have a header that looks something like this:

 

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:68cc19fb-7c62-4272-bc46-cba8947d160d>
WARC-Warcinfo-ID: <urn:uuid:0e8969e5-67a8-4bb8-b607-e79a4836ebbb>
WARC-Concurrent-To: <urn:uuid:16c1fe48-6d6e-43be-8b13-87be55268e5e>
WARC-Target-URI: http://boards.cityofheroes.com/showthread.php?p=3389674
WARC-Date: 2012-09-05T03:16:33Z
WARC-IP-Address: 64.25.35.208
WARC-Block-Digest: sha1:LXWNSCSQQ556BB2MHJKI4RM4QS7NZQWA
WARC-Payload-Digest: sha1:VMESDRD3S6VLSNDJG76JRJAGAJW3K5H2
Content-Type: application/http;msgtype=response
Content-Length: 159891

HTTP/1.1 200 OK
Date: Wed, 05 Sep 2012 03:16:33 GMT
Set-Cookie: bbsessionhash=ff91b849f3c8e8e8e9441213ec5aa751; path=/; HttpOnly
Set-Cookie: bblastvisit=1346814993; expires=Thu, 05-Sep-2013 03:16:33 GMT; path=/
Set-Cookie: bblastactivity=0; expires=Thu, 05-Sep-2013 03:16:33 GMT; path=/
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-UA-Compatible: IE=7
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Set-Cookie: bbreferrerid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bbuserid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bbpassword=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bblastvisit=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bblastactivity=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bbthreadedmode=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bbsessionhash=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bbstyleid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Set-Cookie: bblanguageid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/
Keep-Alive: timeout=2, max=44
Connection: Keep-Alive
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: TS32e4e1=af9d022137a22c65678290c519cf5e4595d529eff757377d5046c3ca; Path=/
Transfer-Encoding: chunked

74

 

And then after that cryptic number 74 (it's a random two character string, I'm guessing some sort of checksum), then you get the actual CONTENT in raw text format. Usually, the contents of an entire HTML page. Sometimes, it's the "gobledygook" of the raw code for an image, for example. I don't think we want or care about those images, right? Or is that a bad assumption to make?

 

I can parse the text, and scrape the actual webpages for the HTML values, and shove them into a database (along with the unique URL as an identifier) so that it's searchable, either by text, or by the URL string. Then, after you search, you can be presented with a page like my Dev Tracker, where I "recreate" the HTML in a sub-window. The caveat to this is that it will most likely be without ANY images in it, since the images were NOT stored in the WARC files, it seems. The URL will be pointing to probably broken images. But all of the HTML should still be good, so the words will be readable and have pretty much the same formatting you would have seen on the original forums.

 

What say you?

I'm out.
Posted

Works for me.... I seem to recall having to host images to put on the old forums, but it has been some time I may not be remembering correctly.

 

Will your database be able to search by username? For example _Zep_ ? :)

 

Thank you again!

 

-Zep

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Posted

I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on.

I'm out.
Posted

I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on.

 

Dont know your specs, am willing to offer some over night time on mine.

 

I have an x4 nvme with about 700gb free for fast IO. I am guessing the processing is mostly single core work though. Not this machines best point.

** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB **


** Corsair Voyager a1600 **

Posted

If someone posted an image, that's content of their post. And the goal is to be able to see people's posts, right? I think I'd like to be able to see images.

 

You misunderstand. These are not images stored on the server within people's posts. Those were always externally hosted. These were images stored on the server for official CoH website images. I don't think it's worth the time to figure out how to make those work properly, when they aren't largely relevant to the forums, except in the cases of things like "online/offline" bubbles and stuff like that, which I see as largely superfluous anyway.

 

Basically, post content was either stored as formatted text in the HTML, or it was referenced via hyperlink to offsite resources. Those resources may or may not still exist. If they still exist, they will show. If they don't, they won't. I mean, I COULD try to take the URL for the resources and try to plug THOSE URLs back into the WayBack Machine and try to resolve to good data, but is it worth it? It'll really slow down the entire process, all around.

I'm out.
Posted

I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on.

 

Dont know your specs, am willing to offer some over night time on mine.

 

I have an x4 nvme with about 700gb free for fast IO. I am guessing the processing is mostly single core work though. Not this machines best point.

 

Yeah, if I had the knowledge to setup "distributed computing", sure I'd bite on that. But I'm not that knowledgeable about that subject, sorry, and I don't think this project is worth learning about it to set that up. Once I get the parsing into the data buckets that I have setup, it'll be largely an automated process, and won't really require any intervention at all. It'll be basically a one-time process. After that, the data's in a format that can be queried, so we'll all be good.

 

Right now, I'm in the process of creating the parsing of the WARC files. I've already figured out how I'm going to process the header elements, what I'll keep and what I'll toss, now I have to figure out how to process the actual page content itself into a form that's searchable. So it's a LOT of.... stare at a forum page and look for a data bit that I want, go to the source of the page, and find that data bit in code... then figure out how I'll grab that data bit via the HTML Agility pack in a way that's UNIQUE to that data bit that won't give me false positives for other elements...

 

That's where I'm at right now. But, hey, it's on the move. I'll let you know when I think I have the actual processing started once I think I'm getting good data in my temp tables. Then, once I have good data in my temp tables, it's a matter of pushing that data to actual real SQL tables, and then setting up a website/front end for y'all to query with.

 

So I guess the next question for y'all is - how do you want to be able to query? I'm assuming:

 

  • Threads/Posts by UserName/ID
  • Search by Partial Subject Line
  • Threads/Posts by Date
  • Words inside of posts (might be REALLY slow to do this! Just a warning!)
  • Seach by URL - would be SUPER easy, but what would that get you that you can't by using the actual Wayback Machine? Maybe this isn't necessary

 

Anything else?

I'm out.
Posted

Progress Update:

 

I've got the WARC files parsed to grab the data that I need in HTML format. I'm currently writing code now to go through that HTML data and parse it further, ripping from it PostIDs, UserIDs, Post Content HTML data, and so on and so on.

 

Here's my schema:

            DataHolder p = new DataHolder();

            p.Pages = new DataTable(nameof(p.Pages));
            p.Pages.Columns.AddRange(new DataColumn[] {
                new DataColumn("URL", typeof(string)),
                new DataColumn("DateSaved", typeof(DateTime)),
                new DataColumn("Content", typeof(string))
            });
            p.Threads = new DataTable(nameof(p.Threads));
            p.Threads.Columns.AddRange(new DataColumn[] {
                new DataColumn("ThreadID", typeof(string)),
                new DataColumn("DateSaved", typeof(DateTime)),
                new DataColumn("Content", typeof(string))
            });
            p.Posts = new DataTable(nameof(p.Posts));
            p.Posts.Columns.AddRange(new DataColumn[] {
                new DataColumn("ThreadID", typeof(int)),
                new DataColumn("PostID", typeof(int)),
                new DataColumn("UserID", typeof(int)),
                new DataColumn("DatePosted", typeof(DateTime)),
                new DataColumn("Content", typeof(string))
            });
            p.Users = new DataTable(nameof(p.Users));
            p.Users.Columns.AddRange(new DataColumn[] {
                new DataColumn("UserID", typeof(int)),
                new DataColumn("UserName", typeof(string)),
                new DataColumn("UserPageContent", typeof(string))
            });
            DataSet ds = new DataSet("CoH Forum Data");
            ds.Tables.AddRange(new DataTable[] { p.Pages, p.Threads, p.Posts, p.Users});

 

I've just created the AddPost method that'll add a new Post to the Posts table, and that works successfully (took a bit to parse all of the proper databits in a valid way, there were some... irregularities in what appeared onscreen because of how vBulletin did things). Now I need to work on:

 

  • AddThread - a collection of posts that are linked by the same Thread ID, sorted Ascnendingly by Post DateTime
  • AddUser - a method so that when I find a "user page" inside of the WARC file, parsing that and adding it as another unique entry in the Users table, so that it's searchable

 

After that, I'll have a fully formed DataSet. Then, once I have the DataSet... Next Steps after that:

  • Setup a SQL server.
  • Shove the data from the DataTables into the SQL server
  • Decide on a user interface - web based? Application based? I'm more familiar with application development, and can do WAY more cool stuff in that world, but I know everyone's moving to web based tools, and it'd be good to learn...
  • Create the Interface
  • Publish the program/website for all to enjoy

 

Now, EACH one of those steps has multiple steps underneath it, I just thought you'd like to know the process thus far. Having access to the files on my hard drive right now is way nicer than just sitting there watching the spider scrape (which it's still doing, by the way, just to have an alternative format someday possibly).

I'm out.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...