Zep Posted June 3, 2019 Author Posted June 3, 2019 I can, and will, thank you just for trying. I appreciate it and I am sure others will as well. ** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB ** ** Corsair Voyager a1600 **
Vanden Posted June 9, 2019 Posted June 9, 2019 So PK, any news about your efforts here? Closer to being finished? Farther? A Cheat Sheet for efficient Endurance Recovery slotting Invention Set Designer Tool Spreadsheet with every Ancillary Power Pool
_NOPE_ Posted June 10, 2019 Posted June 10, 2019 A bit farther, sadly... some of the XML files from the spider got corrupted, and had to be deleted. Here's the latest: (9310/61242) 15.2% But at least we know the start and stop of it... I think. We're in this for the long haul. I'm out.
jack_nomind Posted June 10, 2019 Posted June 10, 2019 But at least we know the start and stop of it... I think. We're in this for the long haul. PK, there's an entire back-up of the public CoH forums in static form. Would it be easier to work with that -- or just instantiate them into a new BB? No-Set Builds: Tanker Scrapper Brute Stalker
_NOPE_ Posted June 10, 2019 Posted June 10, 2019 But at least we know the start and stop of it... I think. We're in this for the long haul. PK, there's an entire back-up of the public CoH forums in static form. Would it be easier to work with that -- or just instantiate them into a new BB? I was not aware that that existed! Thanks for the link! I'll get to looking at that as soon as I get some time to, that may be the best way going forward, thank you! I'm out.
Adeon Hawkwood Posted June 10, 2019 Posted June 10, 2019 But at least we know the start and stop of it... I think. We're in this for the long haul. PK, there's an entire back-up of the public CoH forums in static form. Would it be easier to work with that -- or just instantiate them into a new BB? Dang, 219Gb. Defender Smash!
_NOPE_ Posted June 10, 2019 Posted June 10, 2019 Yeah... but it looks like there's a torrent.. and I DO have a terabyte hard drive... so it shouldn't be an issue... I bet it'll download faster than the spider can crawl the Wayback Engine! I'm out.
The Fifth Horseman Posted June 10, 2019 Posted June 10, 2019 Would it be easier to work with thatIt would be easier to work with that, yes. or just instantiate them into a new BB?About as plausible as reconstructing an anatomically correct cow from a wagon of ground beef. :P
jack_nomind Posted June 10, 2019 Posted June 10, 2019 About as plausible as reconstructing an anatomically correct cow from a wagon of ground beef. :P so... regex? No-Set Builds: Tanker Scrapper Brute Stalker
Marshal_General Posted June 11, 2019 Posted June 11, 2019 I have been trying to find the old WOTC D&D 5e forums using the wayback machine. So far no luck Any tips?
_NOPE_ Posted June 12, 2019 Posted June 12, 2019 I've got the sucker downloaded it, decompressing the archives within the archives now.... ETA for just decompression.... 9 hours. But hey, at least I have the data. I've got about 487 GB free on my terabyte hard drive, and with it looking to have about a 20% compression rate, I should be good to go to get the whole thing uncompressed. Then I have to learn how the WARC file format works, and how to work with it... Then I have to look at the data and make a program to parse it and upsert it into a database... no ETA. I'm out.
Zep Posted June 14, 2019 Author Posted June 14, 2019 Thanks! ** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB ** ** Corsair Voyager a1600 **
Vanden Posted June 15, 2019 Posted June 15, 2019 I've got the sucker downloaded it, decompressing the archives within the archives now.... ETA for just decompression.... 9 hours. But hey, at least I have the data. I've got about 487 GB free on my terabyte hard drive, and with it looking to have about a 20% compression rate, I should be good to go to get the whole thing uncompressed. Then I have to learn how the WARC file format works, and how to work with it... Then I have to look at the data and make a program to parse it and upsert it into a database... no ETA. Is that stuff you'd have had to do with the wayback machine download, too? A Cheat Sheet for efficient Endurance Recovery slotting Invention Set Designer Tool Spreadsheet with every Ancillary Power Pool
_NOPE_ Posted June 15, 2019 Posted June 15, 2019 Nope, because with the way back machine scraping, I'd be working with the RAW files instead of the Internet Archives' WARC format. I still have the scraping happening, but it will probably be slower. I'm out.
Vanden Posted June 15, 2019 Posted June 15, 2019 Also, does the Internet Archive version have any hotlinked images that were being hosted on external sites? The wayback machine version does, that’s how I got my avatar back. Thanks for doing this PK. I’m sure it’ll be a great resource if you can get it working. A Cheat Sheet for efficient Endurance Recovery slotting Invention Set Designer Tool Spreadsheet with every Ancillary Power Pool
_NOPE_ Posted June 15, 2019 Posted June 15, 2019 I haven't had a chance to crack the files open yet, so I can't say at this time. I'm out.
_NOPE_ Posted June 17, 2019 Posted June 17, 2019 Okay, so I had a few minutes today to look at the files. It looks like they are straight up flat text files, that have a header that looks something like this: WARC/1.0 WARC-Type: response WARC-Record-ID: <urn:uuid:68cc19fb-7c62-4272-bc46-cba8947d160d> WARC-Warcinfo-ID: <urn:uuid:0e8969e5-67a8-4bb8-b607-e79a4836ebbb> WARC-Concurrent-To: <urn:uuid:16c1fe48-6d6e-43be-8b13-87be55268e5e> WARC-Target-URI: http://boards.cityofheroes.com/showthread.php?p=3389674 WARC-Date: 2012-09-05T03:16:33Z WARC-IP-Address: 64.25.35.208 WARC-Block-Digest: sha1:LXWNSCSQQ556BB2MHJKI4RM4QS7NZQWA WARC-Payload-Digest: sha1:VMESDRD3S6VLSNDJG76JRJAGAJW3K5H2 Content-Type: application/http;msgtype=response Content-Length: 159891 HTTP/1.1 200 OK Date: Wed, 05 Sep 2012 03:16:33 GMT Set-Cookie: bbsessionhash=ff91b849f3c8e8e8e9441213ec5aa751; path=/; HttpOnly Set-Cookie: bblastvisit=1346814993; expires=Thu, 05-Sep-2013 03:16:33 GMT; path=/ Set-Cookie: bblastactivity=0; expires=Thu, 05-Sep-2013 03:16:33 GMT; path=/ Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Pragma: no-cache X-UA-Compatible: IE=7 Expires: Thu, 19 Nov 1981 08:52:00 GMT Set-Cookie: bbreferrerid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bbuserid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bbpassword=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bblastvisit=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bblastactivity=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bbthreadedmode=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bbsessionhash=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bbstyleid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Set-Cookie: bblanguageid=deleted; expires=Tue, 06-Sep-2011 03:16:32 GMT; path=/ Keep-Alive: timeout=2, max=44 Connection: Keep-Alive Content-Type: text/html; charset=ISO-8859-1 Set-Cookie: TS32e4e1=af9d022137a22c65678290c519cf5e4595d529eff757377d5046c3ca; Path=/ Transfer-Encoding: chunked 74 And then after that cryptic number 74 (it's a random two character string, I'm guessing some sort of checksum), then you get the actual CONTENT in raw text format. Usually, the contents of an entire HTML page. Sometimes, it's the "gobledygook" of the raw code for an image, for example. I don't think we want or care about those images, right? Or is that a bad assumption to make? I can parse the text, and scrape the actual webpages for the HTML values, and shove them into a database (along with the unique URL as an identifier) so that it's searchable, either by text, or by the URL string. Then, after you search, you can be presented with a page like my Dev Tracker, where I "recreate" the HTML in a sub-window. The caveat to this is that it will most likely be without ANY images in it, since the images were NOT stored in the WARC files, it seems. The URL will be pointing to probably broken images. But all of the HTML should still be good, so the words will be readable and have pretty much the same formatting you would have seen on the original forums. What say you? I'm out.
Zep Posted June 17, 2019 Author Posted June 17, 2019 Works for me.... I seem to recall having to host images to put on the old forums, but it has been some time I may not be remembering correctly. Will your database be able to search by username? For example _Zep_ ? :) Thank you again! -Zep ** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB ** ** Corsair Voyager a1600 **
_NOPE_ Posted June 17, 2019 Posted June 17, 2019 I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on. I'm out.
Vanden Posted June 17, 2019 Posted June 17, 2019 If someone posted an image, that's content of their post. And the goal is to be able to see people's posts, right? I think I'd like to be able to see images. A Cheat Sheet for efficient Endurance Recovery slotting Invention Set Designer Tool Spreadsheet with every Ancillary Power Pool
Zep Posted June 18, 2019 Author Posted June 18, 2019 I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on. Dont know your specs, am willing to offer some over night time on mine. I have an x4 nvme with about 700gb free for fast IO. I am guessing the processing is mostly single core work though. Not this machines best point. ** Asus TUF x670E Gaming, Ryzen 7950x, AIO Corsair H150i Elite, TridentZ 192GB DDR5 6400, Sapphire 7900XTX, 48" 4K Samsung 3d & 56" 4k UHD, NVME Sabrent Rocket 2TB, MP600 Pro 8tb, MP700 2 TB. HDD Seagate 12TB ** ** Corsair Voyager a1600 **
_NOPE_ Posted June 18, 2019 Posted June 18, 2019 If someone posted an image, that's content of their post. And the goal is to be able to see people's posts, right? I think I'd like to be able to see images. You misunderstand. These are not images stored on the server within people's posts. Those were always externally hosted. These were images stored on the server for official CoH website images. I don't think it's worth the time to figure out how to make those work properly, when they aren't largely relevant to the forums, except in the cases of things like "online/offline" bubbles and stuff like that, which I see as largely superfluous anyway. Basically, post content was either stored as formatted text in the HTML, or it was referenced via hyperlink to offsite resources. Those resources may or may not still exist. If they still exist, they will show. If they don't, they won't. I mean, I COULD try to take the URL for the resources and try to plug THOSE URLs back into the WayBack Machine and try to resolve to good data, but is it worth it? It'll really slow down the entire process, all around. I'm out.
_NOPE_ Posted June 18, 2019 Posted June 18, 2019 I don't see why not... Because the format in the pages is consistent across all of the pages. I just have to think of the right schema to use. And it will take a LOT of processing time to get these flat files shoved into whatever schema I decide on. Dont know your specs, am willing to offer some over night time on mine. I have an x4 nvme with about 700gb free for fast IO. I am guessing the processing is mostly single core work though. Not this machines best point. Yeah, if I had the knowledge to setup "distributed computing", sure I'd bite on that. But I'm not that knowledgeable about that subject, sorry, and I don't think this project is worth learning about it to set that up. Once I get the parsing into the data buckets that I have setup, it'll be largely an automated process, and won't really require any intervention at all. It'll be basically a one-time process. After that, the data's in a format that can be queried, so we'll all be good. Right now, I'm in the process of creating the parsing of the WARC files. I've already figured out how I'm going to process the header elements, what I'll keep and what I'll toss, now I have to figure out how to process the actual page content itself into a form that's searchable. So it's a LOT of.... stare at a forum page and look for a data bit that I want, go to the source of the page, and find that data bit in code... then figure out how I'll grab that data bit via the HTML Agility pack in a way that's UNIQUE to that data bit that won't give me false positives for other elements... That's where I'm at right now. But, hey, it's on the move. I'll let you know when I think I have the actual processing started once I think I'm getting good data in my temp tables. Then, once I have good data in my temp tables, it's a matter of pushing that data to actual real SQL tables, and then setting up a website/front end for y'all to query with. So I guess the next question for y'all is - how do you want to be able to query? I'm assuming: Threads/Posts by UserName/ID Search by Partial Subject Line Threads/Posts by Date Words inside of posts (might be REALLY slow to do this! Just a warning!) Seach by URL - would be SUPER easy, but what would that get you that you can't by using the actual Wayback Machine? Maybe this isn't necessary Anything else? I'm out.
_NOPE_ Posted June 19, 2019 Posted June 19, 2019 Progress Update: I've got the WARC files parsed to grab the data that I need in HTML format. I'm currently writing code now to go through that HTML data and parse it further, ripping from it PostIDs, UserIDs, Post Content HTML data, and so on and so on. Here's my schema: DataHolder p = new DataHolder(); p.Pages = new DataTable(nameof(p.Pages)); p.Pages.Columns.AddRange(new DataColumn[] { new DataColumn("URL", typeof(string)), new DataColumn("DateSaved", typeof(DateTime)), new DataColumn("Content", typeof(string)) }); p.Threads = new DataTable(nameof(p.Threads)); p.Threads.Columns.AddRange(new DataColumn[] { new DataColumn("ThreadID", typeof(string)), new DataColumn("DateSaved", typeof(DateTime)), new DataColumn("Content", typeof(string)) }); p.Posts = new DataTable(nameof(p.Posts)); p.Posts.Columns.AddRange(new DataColumn[] { new DataColumn("ThreadID", typeof(int)), new DataColumn("PostID", typeof(int)), new DataColumn("UserID", typeof(int)), new DataColumn("DatePosted", typeof(DateTime)), new DataColumn("Content", typeof(string)) }); p.Users = new DataTable(nameof(p.Users)); p.Users.Columns.AddRange(new DataColumn[] { new DataColumn("UserID", typeof(int)), new DataColumn("UserName", typeof(string)), new DataColumn("UserPageContent", typeof(string)) }); DataSet ds = new DataSet("CoH Forum Data"); ds.Tables.AddRange(new DataTable[] { p.Pages, p.Threads, p.Posts, p.Users}); I've just created the AddPost method that'll add a new Post to the Posts table, and that works successfully (took a bit to parse all of the proper databits in a valid way, there were some... irregularities in what appeared onscreen because of how vBulletin did things). Now I need to work on: AddThread - a collection of posts that are linked by the same Thread ID, sorted Ascnendingly by Post DateTime AddUser - a method so that when I find a "user page" inside of the WARC file, parsing that and adding it as another unique entry in the Users table, so that it's searchable After that, I'll have a fully formed DataSet. Then, once I have the DataSet... Next Steps after that: Setup a SQL server. Shove the data from the DataTables into the SQL server Decide on a user interface - web based? Application based? I'm more familiar with application development, and can do WAY more cool stuff in that world, but I know everyone's moving to web based tools, and it'd be good to learn... Create the Interface Publish the program/website for all to enjoy Now, EACH one of those steps has multiple steps underneath it, I just thought you'd like to know the process thus far. Having access to the files on my hard drive right now is way nicer than just sitting there watching the spider scrape (which it's still doing, by the way, just to have an alternative format someday possibly). I'm out.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now