AI scrapers request commented scripts
I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM
Fun to see practical applications of interesting research[1]
It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".
The title is confusing, should be "commented-out".
when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.
Sounds like you should give the bots exactly what they want... a 512MB file of random data.
Two thoughts here when it comes to poisoning unwanted LLM training data traffic
1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.
2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.
> most likely trying to non-consensually collect content for training LLMs
No, it's just background internet scanning noise
Well, if they’re going to request commented out scripts, serve them up some very large scripts…
I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.
Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.
If you want humans to read your website, I would suggest making your website readable to humans. Green on blue is both hideous and painful.
>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.
"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.
Yes, I know about weev. That was a travesty.
[flagged]
[flagged]
[flagged]
[flagged]
Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.
I know a thing or two about web scraping.
There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).
Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.
Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.
That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.
https://github.com/rumca-js/crawler-buddy
Based on library
https://github.com/rumca-js/webtoolkit