mechahamham Posted September 15 Posted September 15 Someone mentioned LLM scrapers already, which are typically HORRIBLE netizens, but there are a few other kinds of automated queries that are better behaved. Search Indexers typically respect robots.txt, which is a standard to tell search engines which pages you do and do not want indexed so people can find them with web searches. For example, static pages or older discussions are good to index while some generated pages or a brand new discussion that hasn't been edited yet are poor subjects for indexing. Cache Servers are another source of 'anonymous' web traffic. These LOOK LIKE they're sucking up all a website's valuable resources, but what they do is suck up those resources once and then repeat your as often as necessary for browsers who connect to the internet through them. This is another one of the reasons that it's REALLY IMPORTANT to properly configure your robots.txt, use CDN servers, or otherwise separate your generated content and static content when making websites. If you do so, cache servers can save you a ton of money, even if they belong to someone else. On sites that are database-driven forums like this one, they mostly scan for images and other static Binary Large OBjects, sometimes called 'blobs': files that aren't likely to change. (Big providers of BLOBs like Netflix and Amazon Video often deploy custom cache servers to ISPs' and telecoms' local offices in order to reduce costs and increase reliability for everyone involved.) An important resource for us, personally, the City of Heroes community, is the Internet Wayback Machine archive provided by Archive.org. https://web.archive.org/web/20120701000000*/http://boards.cityofheroes.com/ is an interface that Archive.org has into archived pages from the old City of Heroes forums. There are other archives, including an SQL dump of the whole damn forum, floating around. Without archivers, all that stuff would just be straight up lost to the sands of times. So while I wish hordes of angry Canadian geese into the homes, businesses, and bedrooms of for-profit LLM scrapers, some guest use of the site is not only good, but ABSOLUTELY NECESSARY! 1 1
Lunar Ronin Posted September 15 Posted September 15 2 hours ago, mechahamham said: Someone mentioned LLM scrapers already, which are typically HORRIBLE netizens, but there are a few other kinds of automated queries that are better behaved. Some LLM scrapers are better than others, theoretically. OpenAI and Anthropic both say now that they'll respect robots.txt, but they also keep changing their crawlers' names. Perplexity will just outright ignore robots.txt. I try to keep my robots.txt updated on all three websites I run to block all known LLM crawlers, but who knows how many (if any), will actually respect it. 1
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now