I actually don't understand who Anubis is supposed to "make sure you're not a bot". It seems to be more of a rate limiter than anything else. It self-describes:
> Anubis sits in the background and weighs the risk of incoming requests. If it asks a client to complete a challenge, no user interaction is required.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums. Anubis has a customizable difficulty for this proof-of-work challenge, but defaults to 5 leading zeroes.
When I go to Codeberg or any other site using it, I'm never asked to perform any kind of in-browser task. It just has my browser run some JavaScript to do that calculation, or uses a signed JWT to let me have that process cached.
Why shouldn't an automated agent be able to deal with that just as easily, by just feeding that JavaScript to its own interpreter?
Presumably they just finally decided they were willing to spend ($) the CPU time to pass the Anubis check. That was always my understanding of Anubis--of course a bot can pass it, it's just going to cost them a bunch of CPU time (and therefore money) to do it.
I've seen a lot of traffic from Huawei bypassing Anubis on some of the things I host as well. The funny thing is, I work for Huawei... Asking around, it seems most of it is coming from Huawei Cloud (like AWS) but their artifactory cache also shows a few other captcha bypassing libraries for Arkose/funcaptcha so they're definitely doing it themselves too.
Anonymous account for obvious reasons.
Last time I checked, Anubis used SHA256 for PoW. This is very GPU/ASIC friendly, so there's a big disparity between the amount of compute available in a legit browser vs a datacentre-scale scraping operation.
A more memory-hard "mining" algorithm could help.
Really feels like this needs some sort of unified possibly legal approach to get these fkers to behave.
Search era clearly proved it is possible to crawl respectfully - the AI crawlers have just decided not to. They need to be disincentivized from doing this
How about using on-chain proof-of-work? It flips the script.
If a bot wants access, let it earn it—and let that work be captured, not discarded. Each request becomes compensation to the site itself. The crawler feeds the very system it scrapes. Its computational effort directly funds the site owner's wallet, joining the pool to complete its proof.
The cost becomes the contract.
I feel for Codeberg and people who against AI but also I think Anubis can’t die soon enough. It breaks archiving and is very annoying when JS is disabled or when faced with an aggressive ad-blocker. It breaks web in more ways than one.
It is just sad we are in a time where measures like Anubis is necessary. The author's efforts are admirable, so I don't mean this personally: but Anubis is a bad product IMHO.
It doesn't quite do what it is advertised to do, as evidenced by this post; and it degrades user experience for everybody. And it also stops the website from being indexed by search engines (unless specifically configured otherwise). For example, gitlab.freedesktop.org pages have just disappeared from Google.
We need to find a better way.
I'm not anti-the-tech-behind-AI, but this behavior is just awful, and makes the world worse for everyone. I wish AI companies would instead, I don't know, fund common crawl or something so that they can have a single organization and set of bots collecting all the training data they need and then share it, instead of having a bunch of different AI companies doing duplicated work and resulting in a swath of duplicated requests. Also, I don't understand why they have to make so many requests so often. Why wouldn't like one crawl of each site a day, at a reasonable rate, be enough? It's not like up to the minute info is actually important since LLM training cutoffs are always out of date anyway. I don't get it.
Really looks like the last solution is a legal one, using the DMCA against them using the digital protection or access control circumvention clause or smth.
Making my web resources IPv6-only has solved the problem for me. I don’t consider this a solution for ever, but for now it’s apparently way too modern or complicated for the A-so-called-I companies.
What do these crawlers gather? Just make this data accessible via API calls or direct database download, like Wikipedia did (https://en.wikipedia.org/wiki/Wikipedia:Database_download).
Crazy thought but what if you made the work required to access the site equal the work required to host site. Host the public part of the database on something like webtorrent. Render website from db locally. You want to ruin expensive queries? Suit yourself. Not easy, but maybe possible?
thanks for making everything that much shittier just so you can steal everyone's data and present it as your own, AI companies!
Why not ask it to directly mine some bitcoin, or do some protein folding? Let's make proof-of-work challenges proof-of-useful-work challenges. The server could even directly serve status 402 with the challenge.
anubis and others allow some user agents to pass without proof of work. bad bots (and user) just use an extension that detect anubis and change the user agent instead.
it's well intentioned but just waste electricity from good people in the end.
anubis does nothing to impact bad crawlers, well only the laziest ones. but for those generating fake infinite content on the fly is much more efficient.
Are AI crawlers equipped to get past reCAPTCHA or hCAPTCHA? This seems like exactly the thing these services were meant to stop.
This was beyond predictable. The monetary cost of proof of work is several orders of magnitude too small to deter scraping (let alone higher yield abuse), and passing the challenges requires no technical finesse basically by construction.
It's PoW, AI crawlers didn't learn shit, their admins just increased their CPU/GPU/ASIC budget
How long it will be until there will be nothing left to scrape and this activity ends? I mean, at some point payback from it will become less than 0 because they will be consuming more AI generated stuff than real (even adjusted for their ability to detect and filter it), and the content itself will be of very little value (because all high value one will be already scraped). Anyone tried to estimate how long that might take?
I'm calling it now, this is the beginning of all of the remaining non-commerical properties on the web either going away, or getting hidden inside of some trusted overlay network. Unless the "AI" race slows down or changes or some other act of god happens, the incentives are aligned that I foresee wide swaths of the net getting flogged to death.
Fight fire with fire by serving these guys LLM output of made-up news. Wish them good luck noticing that in their dataset.
This is sad, but predictable. At the end of the day if I can follow a link to an Anubis protected site and view it on my phone, the crawlers will be able to as well.
I see a lot more private networks in our future, unfortunately.
They failed to properly block/throttle the IP subnet as per their admission, and are now blaming others for their failure.
Duh. The author of Anubis really should advertise it as a DDoS guard, not an AI guard. Otherwise, xe is just misleading people, while being unnecessarily discriminatory against robots (robotkin) and cyborgs (those using AI agents as an extension of their selves).
I just found out about this when it came to the front page of Hacker News. I really wish I was given advanced notice. I haven't been able to put as much energy into Anubis as I've wanted because I've been incredibly overwhelmed by life and need to be able to afford to make this my full time job. Support contracts are being roadblocked, and I just wish I had the time and energy to focus on this without having to worry about being the single income for the household.