content-length is computed after content-encoding

Is there a difference between &quot;scraping&quot; and &quot;crawling&quot;

&gt; Looks like it&#x27;s time for in-browser scrappers.If scrapers were as well-behaved as humans, website operators wouldn&#x27;t bother to block them[1]. It&#x27;s the abuse that motivates the animus and action. As the fine articles spelt out, scrapers are greedy in many ways, one of which is trying to slurp down as many URLs as possible without wasting bytes. Not enough people know about common crawl, or know how to write multithreaded scrapers with high utilization across domains without suffocating any single one. If your scraper is URL FIFO or stack in a loop, you&#x27;re just DOSing one domain at a time.1. The most successful scrapers avoid standing out in any way

Not a new idea. For years now, on the occasions I’ve needed to scrape, I’ve used a set of ViolentMonkey scripts. I’ve even considered creating an extension, but have never really needed it enough to do the extra work.But this is why lots of sites implement captchas and other mechanisms to detect, frustrate, or trap automated activity - because plenty of bots run in browsers too.

Looks like it&#x27;s time for in-browser scrappers. They will be indistinguishable from the servers side. With AI driver can pass even human tests.

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.I know a thing or two about web scraping.There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).Zip bombs are also not for me. Reading header content length is enough to not read the page&#x2F;file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.<a href="https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;crawler-buddy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;crawler-buddy</a>Based on library<a href="https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;webtoolkit" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;webtoolkit</a>

Reminds me of the shortcut that works for the happy path but is utterly fucked by real data. This is an interesting trap, can it easily be avoided without walking the dom?

Not probably, searching through plaintext (which they seem to be doing) VS iterating on the DOM have vastly different amount of work behind them in terms of resources used and performance that &quot;probably&quot; is way underselling the difference :)

The regex approach is certainly easier to implement, but honestly static DOM parsing is pretty cheap, but quite fiddly to get right. You&#x27;re probably gonna be limited by network congestion (or ephemeral ports) before you run out of CPU time doing this type of crawling.

I&#x27;m not overly surprised, it&#x27;s probably faster to search the text for http&#x2F;https than parse the DOM

Fun to see practical applications of interesting research[1][1]<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=45529587">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=45529587</a>

They call the scrapers &quot;malicious&quot;, so they are definitely complaining about them.&gt; A few of these came from user-agents that were obviously malicious:(I love the idea that they consider any python or go request to be a malicious scraper...)

Please don&#x27;t comment on whether someone read an article. &quot;Did you even read the article? It mentions that&quot; can be shortened to &quot;The article mentions that&quot;.[1][1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;newsguidelines.html">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;newsguidelines.html</a>

the first few words of the article are:&gt; Last Sunday I discovered some abusive bot behaviour [...]

The article mentions using this as a means of detecting bots, not as a complaint that it&#x27;s abusive.EDIT: I was chastised, here&#x27;s the original text of my comment: Did you read the article or just the title? They aren&#x27;t claiming it&#x27;s abusive. They&#x27;re saying it&#x27;s a viable signal to detect and ban bots.

Human behavior is interesting - me, me, me…

Crawlers ignoring robots.txt is abusive. That they then start scanning all docs for commented urls just adds to the pile of scummy behaviour.

It doesn&#x27;t seem that abusive. I don&#x27;t comment things out thinking &quot;this will keep robots from reading this&quot;.

I thought it was going to be AI scraper operators getting annoyed that they have to run reasoning models on the scraped data to make use of it.

Agree, I thought maybe this was going to be a script to block AI scrapers or something like that.

The title is confusing, should be &quot;commented-out&quot;.

Doing both is fine! Just, once you&#x27;ve figured out your regex and such, hardening&#x2F;generalizing demands DOM iteration. It sucks but it is what is is.

but not when crawling. you don&#x27;t know the page format in advance - you don&#x27;t even know what the page contains!

DOM navigation for fetching some data is for tryhards. Using a regex to grab the correct paragraph or div or whatever is fine and is more robust versus things moving around on the page.

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

Now that&#x27;s a paid subscription I can get behind, especially if it suggests that Meta should cut Rob Schneider a check for $200,000,000,000 to make more movies.

That&#x27;s leaving a lot of opportunity on the table.The real money is in monetizing ad responses to AI scrapers so that LLMs are biased toward recommending certain products. The stealth startup I&#x27;ve founded does exactly this. Ad-poisoning-as-a-service is a huge untapped market.

Outbound traffic normally costs more than inbound one, so the asymmetry is set up wrong here. Data poisoning is probably the way.

Scraper sinkhole of randomly generated inter-linked files filled with AI poison could work. No human would click that link, so it leads to the &quot;exclusive club&quot;.

512MB file of incredibly compressible data, then?

Most people have to pay for their bandwidth, though. That&#x27;s a lot of data to send out over and over.

512 MB of saying your service is the best service.

Sounds like you should give the bots exactly what they want... a 512MB file of random data.

As for 1, it would be great to have this as a plugin for WordPress etc. that anyone could simply install and enable. Pre-processing images to dynamically poison them on each request should be fun, and also protect against a deduplication defense. I&#x27;d certainly install that.

Two thoughts here when it comes to poisoning unwanted LLM training data traffic1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

This.If you were writing a script to mass-scan the web for vulnerabilities, you would want to collect as many http endpoints as possible. JS files, regardless of whether they&#x27;re commented out or not, are a great way to find endpoints in modern web applications.If you were writing a scraper to collect source code to train LLMs on, I doubt you would care as much about a commented-out JS file. I&#x27;m not sure you&#x27;d even want to train on random low-quality JS served by websites. Anyone familiar with LLM training data collection who can comment on this?

&gt; most likely trying to non-consensually collect content for training LLMsNo, it&#x27;s just background internet scanning noise

Well, if they’re going to request commented out scripts, serve them up some very large scripts…

It’s been some time since I have dealt with web scrapers but it takes less resources to run a regex than it does to parse the DOM (which may have syntactically incorrect parts anyway). This can add up when running many scraping requests in parallel. So depending on your goals using a regex can be much preferred.

You don&#x27;t need to teach parsing, that won&#x27;t help much any way. We need to teach people to be good netizen again. I&#x27;d argue that it was always viewed as reasonable to scrape content, as long as you didn&#x27;t misrepresent content as your own and if you scraped responsibly, backing of if the server started to slow down, or simply not crawling to fast to begin with.Currently we have at least three problems:1) Companies have no issue with not providing sources and not linking back.2) There are too many scrapers, even if they behaved, some site would struggle to handle all of them.3) Srapers go full throttle 24&#x2F;7, expecting the sites to rate-limit them if they are going to fast. Hammer a site into the ground, just wait until it&#x27;s back and hammer it again, grabbing what you can before it crashes once more.There&#x27;s no longer a sense of the internet being for all of us and that we need to make room for each other. Website &#x2F; human generated content exists as a resource to be strip mined.

You don&#x27;t need javascript to parse HTML. Just use an HTML parser. They are very fast. HTML isn&#x27;t a regular language, so you can&#x27;t parse it with regular expressions.Obligatory: <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;1732348&#x2F;regex-match-open-tags-except-xhtml-self-contained-tags" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;1732348&#x2F;regex-match-open...</a>

How would recommend doing it? If I was just trying to pull &lt;a&#x2F;&gt; tag links out I feel like treating it like text and using regex would be way more efficient than a full on HTML parser like JSDom or something.

Do you think that if some CS programs taught parsing, the authors of the bot would parse the HTML to properly extract links, instead of just doing plain text search?I doubt it.

The people who do this type of scraping to feed their AI are probably also using AI to write their scraper.

I blame modern CS programs that don&#x27;t teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to &quot;parse&quot; html to find various references.Maybe that&#x27;s a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

If you want humans to read your website, I would suggest making your website readable to humans. Green on blue is both hideous and painful.

&gt; robots.txt. This is not the lawIn Germany, it is the law. § 44b UrhG says (translated):(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

&gt; Well behaved robots do not usually use millions of residential IPsSome antivirus and parental control control software will scan links sent to someone from their machine (or from access points&#x2F;routers).Even some antivirus services will fetch links from residential IPs in order to detect malware from sites configured to serve malware only to residential IPs.Actually, I&#x27;m not entirely sure how one would tell the difference between a user software scanning links to detect adult content&#x2F;malware&#x2F;etc, randos crawling the web searching for personal information&#x2F;vulnerable sites&#x2F;etc. and these supposed &quot;AI crawlers&quot; just from access logs.While I&#x27;m certainly not going to dismiss the idea that these are poorly configured crawlers at some major AI company, I haven&#x27;t seen much in the way of evidence that is the case.

&gt; And no, the correct answer to such a situation would not be &quot;but it was free, don&#x27;t offer it for free if you don&#x27;t want it to be taken for free&quot;.The answer to THAT could: &quot;It is free but leave some for others you greedy fuck&quot;

When I open an HTTP server to the public web, I expect and welcome GET requests in general.However,(1) there&#x27;s a difference between (a) a regular user browsing my websites and (b) robots DDoSing them. It was never okay to hammer a webserver. This is not new, and it&#x27;s for this reason that curl has had options to throttle repeated requests to servers forever. In real life, there are many instances of things being offered for free, it&#x27;s usually not okay to take it all. Yes, this would be abuse. And no, the correct answer to such a situation would not be &quot;but it was free, don&#x27;t offer it for free if you don&#x27;t want it to be taken for free&quot;. Same thing here.(2) there&#x27;s a difference between (a) a regular user reading my website or even copying and redistributing my content as long as the license of this work &#x2F; the fair use or related laws are respected, and (b) a robot counterfeiting it (yeah, I agree with another commenter, theft is not the right word, let&#x27;s call a spade a spade)(3) well-behaved robots are expected to respect robots.txt. This is not the law, this is about being respectful. It is only fair bad-behaved robots get called out.Well behaved robots do not usually use millions of residential IPs through shady apps to &quot;Perform a get request to an open HTTP server&quot;.

Browser user agents have a history of being lies from the earliest days of usage. Official browsers lied about what they were- and still do.

Somebody concealing or obfuscating various information a browser would send by standard for privacy or other reasons is also “lying” by that standard? Or someone using a VPN?

If you&#x27;re lying in the requests you send, to trick my server into returning the content you want, instead of what I would want to return to webscrapers, that&#x27;s non-consensual.You don&#x27;t need my permission to send a GET request, I completely agree. In fact, by having a publicly accessible webserver, there&#x27;s implied consent that I&#x27;m willing to accept reasonable, and valid GET requests.But I have configured my server to spend server resources the way I want, you don&#x27;t like how my server works, so your configure your bot to lie. If you get what you want only because you&#x27;re willing to lie, where&#x27;s the implied consent?

&gt; robots.txt is a polite request to please not scrape these pagesPeople who ignore polite requests are assholes, and we are well within our rights to complain about them.I agree that &quot;theft&quot; is too strong (though I think you might be presenting a straw man there), but &quot;abuse&quot; can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.&gt; Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.That&#x27;s a shitty world that we shouldn&#x27;t have to live in.

&quot;Theft&quot; may be wrong, but &quot;abuse&quot; certainly is not. Human interactions in general, and the web in particular, are built on certain set of conventions and common behaviors. One of them is that most sites are for consuming information at human paces and volumes, not downloading their content wholesale. There are specialized sites that are fine with that, but they say it upfront. Average, especially hobbyist site, is not that. People who do not abide by it are certainly abusing it.&gt; Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don&#x27;t see why people would spend time to prove this is the only possible and normal world - it&#x27;s certainly not the case, we can do better.

If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.

Seriously. Did you see what that web server was wearing? I mean, sure it said &quot;don&#x27;t touch me&quot; and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn&#x27;t block that so clearly they weren&#x27;t serious. They were asking for it. It&#x27;s their fault. They&#x27;re not really victims.

&gt; robots.txt is a polite request to please not scrape these pagesAt the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I&#x27;m prepared to accept the consequences of a &quot;real user&quot; tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that&#x27;ll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I&#x27;m 100% OK with those consequences. For work&#x27;s website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.Anybody who doesn&#x27;t consider typical AI company&#x27;s webscraping behaviour over the last few years to qualify as &quot;abuse&quot; has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular&#x2F;respected sites.

How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.

&gt; I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like &quot;theft&quot; and &quot;abuse.&quot;This feels like the kind of argument some would make as to why they aren&#x27;t required to return their shopping cart to the bay.&gt; robots.txt is a polite request to please not scrape these pages because it&#x27;s probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.Well, no. That&#x27;s an overly simplistic description which fits your argument, but doesn&#x27;t accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it&#x27;s called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.&gt; It&#x27;s kind of like leaving a note for the deliveryman saying please don&#x27;t leave packages on the porch.It&#x27;s nothing like that, it&#x27;s more like a note that says no soliciting, or please knock quietly because the baby is sleeping.&gt; It&#x27;s fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.&gt; Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.Because it&#x27;s your fault if you don&#x27;t, right? That&#x27;s victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don&#x27;t care... And that&#x27;s my fault, right?I guess that&#x27;s a take...There&#x27;s a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.

The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.

There&#x27;s an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like &quot;theft&quot; and &quot;abuse.&quot;robots.txt is a polite request to please not scrape these pages because it&#x27;s probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.It&#x27;s kind of like leaving a note for the deliveryman saying please don&#x27;t leave packages on the porch. It&#x27;s fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

If you&#x27;re trying to say DDoS, just say that.

You are still trying to pretend that accessing HTTP server once and burying it under an avalanche of never-stopping bot crawlers is the same thing? And spam is the same as &quot;sending an email&quot; and should be treated the same? I thought in this day and age we&#x27;re past that.

Then cutting up the candy and taping candy together in the most statistically pleasing way and finally selling all of the stolen frankenstein’s monster candy as innovative new candy and the future of humanity.

and if they do, you have no recourse just like with scrapers. with the candy example, you spend you time sitting near the candy bowl supervising. for servers, we have various anti-bot supervisors. however, some asshat with no scruples can still just walk right up to your bowl and empty the contents into their bag and then just walk away even with you sitting right there. Unless you&#x27;re willing to commit violence, there&#x27;s nothing stopping them. now you&#x27;re the assailant and the asshat is the victim. you still loose.

If I set out a bowl of candy for ticker treaters, I wouldn&#x27;t expect to be okay with the first adult strolling by and taking everything.

running the scraping bots cost money too.

Scraping is legal. DDoSing isn&#x27;t.We should start suing these bad actors. Why do techies forget that the legal system exists?

The problem is that serving content costs money. Llm scraping is essentially ddos&#x27;ing content meant for human consumption. Ddos&#x27;ing sucks.

&gt; In my opinion, this should be the law and it should be enforcedYou think people should go to prison if they go to their browser settings and change their user agent?

You should not have to ask for permission, but you should have to honestly set your user-agent. (In my opinion, this should be the law and it should be enforced)

yeah all open HTTP servers are fair game for DDoS because well it&#x27;s open right?

&gt; Are you now saying the website owner should be able to dictate what client I use and how it must behave?Already pretty well established with Ad-block actually. It&#x27;s a pretty similar case even. AI&#x27;s don&#x27;t click ads, so why should we accept their traffic? If it&#x27;s un-proportionally loading the server without contributing to the funding of the site, get blocked.The server can set whatever rules it wants. If the maintainer hates google and wants to block all chrome users, it can do so.

Yes? I&#x27;d suggest that you understand that&#x27;s not an unreasonable expectation either.Your browser has a bug, if you leave my webpage open in a tab, because of that bug, it&#x27;s going to close the connection, reconnect, new tls handshake and everything and re-request that page without any cache tag, every second, everyday, for as long as you have the tab open.That feels kinda problematic, right?Web servers block well formed clients all the time, and I agree with you, that&#x27;s dumb. But servers should be allowed to serve only the traffic they wish. If you want to use some LLM client, but the way that client behaves puts undue strain on wy server, what should I do, just accept that your client, and by proxy you, are an asshole and just accept that?You shouldn&#x27;t put your rules on my webserver, exactly as much I my webserver shouldn&#x27;t put my rules on yours. But i believe that ethically, we should both attempt to respect and follow the rules of the other. Blocking traffic when it starts to behave abusively. It&#x27;s not complex, just try to be nice and help the other as much as you reasonably can.

What about people using an LLM as their web client? Are you now saying the website owner should be able to dictate what client I use and how it must behave?

I think there&#x27;s a massive shift in what the letter of the law needs to be to match the intent. The letter hasn&#x27;t changed and this is all still quite legal - but there is a significant different between what webscraping was doing to impact creative lives five years ago and today. It was always possible for artists to have their content stolen and for creative works to be reposted - but there was enough IP laws around image sharing (which AI disingenuously steps around) and other creative work wasn&#x27;t monetarily efficient to scrape.I think there is a really different intent to an action to read something someone created (which is often a form of marketing) and to reproduce but modify someone&#x27;s creative output (which competes against and starves the creative of income).The world changed really quickly and our legal systems haven&#x27;t kept up. It is hurting real people who used to have small side businesses.

This mindset really baffles me. Just because it is not illegal doesn&#x27;t mean one should do it. And for anything truly innovative there are bound to be gaps in the current law.It&#x27;s pretty obvious that there is an asymmetry in benefit between those creating the models and those creating the content. If that doesn&#x27;t bother you consider the fact that this currently undermines the economic and social model for open content creation on the internet.What happens when the content significantly decreases?Should those who create content not have some say in how their content is used?

The sign on the door said &quot;no scrapers&quot;, which as far as I know is not a protected class.

I mean, it costs money to host content. If you are hosting content for bots fine, but if the money you&#x27;re paying to host it is meant to benefit human users (the reason for robots.txt) then yeah, you ought to ask permission. Content might also be copyrighted. Honestly, I don&#x27;t even know why I&#x27;m bothering to mention these things because it just feels obvious. LLM scrapers obviously want as much data as they can get, whether or not they act like assholes (ignoring robots.txt) or criminals (ignoring copyright) to get it.

Ah yes, the “it’s ok because I can” school of thought. As if that was ever true.

Yes, but you may get caught, and there suffer &#x27;consequences&#x27;.
I can drive well over 220kmh+ on the autobahn (Germany, Europe), and also in France (also in Europe).
One is acceptable, the other will get me Royale-e fucked.
If the can catch me.

So if a house is not not locked I can take whatever I want?

&gt;These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.&quot;Non-consensually&quot;, as if you had to ask for permission to perform a GET request to an open HTTP server.Yes, I know about weev. That was a travesty.

It is an asocial state of mind. We have locks and security systems that prevent people from stealing. But if all people agreed to not steal, then we could save that efforts for something better. The ideal approach doesn&#x27;t work with the stealing, and now it doesn&#x27;t work with HTTP either. It just raises costs for a society with no lasting benefit for anyone: site owners just figure out ways to restrict access and no more scraping of pages that they do not want to be scraped.A healthy society relies on a cooperation between members. It relies on them accepting some rules that limits their behavior. Like we agreed not to kill others, and now I can go outside without weapons and anti-bullet defenses.

What controls do you suggest?Saying that a handful of mass copyright infringers with billion dollar investors are simply part of the &quot;public&quot; like every regular visitor is seriously distorting the issue here.Sites with a robots.txt banning bots are only &quot;unrestricted&quot; in a strictly technical sense. They are clearly setting terms of use that these rogue bots are violating. Besides, robots.txt is legally binding in certain jurisdictions, it&#x27;s not just a polite plea. And if we decide that anything not technically prevented is legal, then we&#x27;re also legitimising botnets, DDoS attacks, and a lot more. Hacking into a corporate system through a malconfiguration or vulnerability is also illegal, despite the fact that the defenses failed.Finally, we all know that the only purpose these bots are scraping for is mass copyright infringement. That&#x27;s another layer where the &quot;if it&#x27;s accessible, it&#x27;s fair game&quot; logic falls apart. I can download a lot of publicly accessible art, music, or software, but that doesn&#x27;t mean I can do with those files whatever I want. The only reason these AI companies haven&#x27;t been sued out of existence yet, like they should&#x27;ve been, is that it&#x27;s trickier to prove provenance than if they straight up served the unmodified files.

Yep. Robots.txt is a framework intended for performance, not a legal or ethical imperative.If you want to control how someone accesses something, the onus is on you to put access controls in place.The people who put things on a public, un-restricted server and then complain that the public accessed it in an un-restricted way might be excusable if it&#x27;s some geocities-esque Mom and Pop site that has no reason to know better, but &#x27;cryptography dog&#x27; ain&#x27;t that.

AI scrapers request commented scripts