I might be crazy, but this just feels like a marketing tactic from Anthropic to try and show that their AI can be used in the cybersecurity domain.
My question is, how on earth does does Claude Code even "infiltrate" databases or code from one account, based on prompts from a different account? What's more, it's doing this to what are likely enterprise customers ("large tech companies, financial institutions, ... and government agencies"). I'm sorry but I don't see this as some fancy AI cyberattack, this is a security failure on Anthropic's part and that too at a very basic level that should never have happened at a company of their caliber.
I think as AI gets smarter, defenders should start assembling systems how NixOS does it.
Defenders should not have to engage in an costly and error-prone search of truth about what's actually deployed.
Systems should be composed from building blocks, the security of which can be audited largely independently, verifiably linking all of the source code, patches etc to some form of hardware attestation of the running system.
I think having an accurate, auditable and updatable description of systems in the field like that would be a significant and necessary improvement for defenders.
I'm working on automating software packaging with Nix as one missing piece of the puzzle to make that approach more accessible: https://github.com/mschwaig/vibenix
(I'm also looking for ways to get paid for working on that puzzle.)
>At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails. They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.
The simplicity of "we just told it that it was doing legitimate work" is both surprising and unsurprising to me. Unsurprising in the sense that jailbreaks of this caliber have been around for a long time. Surprising in the sense that any human with this level of cybersecurity skills would surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".
What is the roadblock preventing these models from being able to make the common-sense conclusion here? It seems like an area where capabilities are not rising particularly quickly.
Very funny at the end when they say that the strong safeguards they've built into Claude make it a good idea to continue developing these technologies. A few paragraphs earlier they talked about how the perpetrators were able to get around all those safeguards and use Claude for 90% of the work hahaha
> At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails.
If you can bypass guardrails, they're, by definition, not guardrails any longer. You failed to do your job.
Anyone using Claude for processing sensitive information should be wondering how often it ends up in front of a humans eyes as a false positive
None of that talking about the cost of running such an attack and what models were involved during which phases. Seems like you can use Anthropic now as a proxies bot net
Recently I've used Claude Code to perform some entry to mid level web-based CTF hunting in a fully autonomous mode (--allow-dangerously-skip-permissions in an isolated environment). It excels at low hanging fruit - XSS, other injections, IDOR, hidden form fields, session fixation, careful enumeration, etc.
Wait a minute - the attackers were using the API to ask Claude for ways to run a cybercampaign, and it was only defeated because Anthropic was able to detect the malicious queries? What would have happened if they were using an open-source model running locally? Or a secret model built by the Chinese government?
I just updated by P(Doom) by a significant margin.
It sounds like they directly used Anthropic-hosted compute to do this, and knew that their actions and methods would be exposed to Anthropic?
Why not just self-host competitive-enough LLM models, and do their experiments/attacks themselves, without leaking actions and methods so much?
so even Chinese state actors prefer Claude over Chinese models?
edit: Claude: recommended by 4 of 5 state sponsored hackers
Unfortunately, cyber attacks are an application that AI models should excel at. Mistakes that in normal software would be major problems will just have the impact of wasting resources, and it's often not that hard to directly verify whether it in fact succeeded.
Meanwhile, AI coding seems likely to have the impact of more security bugs being introduced in systems.
Maybe there's some story where everyone finds the security bugs with AI tools before the bad guys, but I'm not very optimistic about how this will work out...
A whole lot of claims made by a company that doesn't specialize in cyber attacks.
It sounds like they built a malicious Claude Code client, is that right?
> The threat actor—whom we assess with high confidence was a Chinese state-sponsored group—manipulated our Claude Code tool into attempting infiltration into roughly thirty global targets and succeeded in a small number of cases. The operation targeted large tech companies, financial institutions, chemical manufacturing companies, and government agencies. We believe this is the first documented case of a large-scale cyberattack executed without substantial human intervention.
They presumably still have to distribute the malware to the targets, making them download and install it, no?
So basically, Chinese state-backed hackers hijacked Claude Code to run some of the first AI-orchestrated cyber-espionage, using autonomous agents to infiltrate ~30 large tech companies, banks, chemical manufacturers and government agencies.
What's amazing is that AI executed most of the attack autonomously, performing at scale and speed unattainable by human teams - thousands of operations per second. A human operator intervened 4-6 times per campaign for strategic decisions
I have the feeling that we are still in the early stages of AI adoption, where regulation hasnt fully caught up yet. I can imagine a future where LLMs sit behind KYC identification and automatically report any suspicious user activity to the authorities... I just hope we won’t someday look back on this period with nostalgia :)
Interesting that Claude also just hallucinated information as it does for all of us. But perhaps a better guardrail would be not to refuse things like this but frustrate the use by giving them fake results in believable ways.
A stupid but helpful agent is worse for a bad actor than a good agent that refuses
The gaps that led to this was, I think, part of why the CISO got replaced - https://www.thestack.technology/anthropic-new-ciso-claude-cy...
After Anthropic "disrupted" these attackers, I'm sure they gave up and didn't try using another LLM provider to do the exact same thing.
Chinese have their own coding agents on par with Claude Code, why would they use Claude Code? Also if such agents are useful, they could just FT/RL their own for such specific use case (cyber espionage campaign) and get far better performance.
This is basically an IQ test. It gives me the feeling that anthropic is literally implying that Chinese state backed hackers don't have access to be the best Chinese AI and had to use American ones.
Curious why they didn't use DeepSeek... They could've probably built one tuned for this type of campaign.
> The threat actor—whom we assess with high confidence was a Chinese state-sponsored group—manipulated our Claude Code tool into attempting infiltration into roughly thirty global targets and succeeded in a small number of cases.
> The attackers used AI ... to execute the cyberattacks
Translation: "The attacker's paid us to use our product to execute the cyberattacks"
Easy solution: block any “agentic AI” from interacting with your systems at all.
> we detected a highly sophisticated cyber espionage operation conducted by a Chinese state-sponsored group we've designated GTG-1002
How about calling them something like xXxDragonSlayer69xXx instead? GTG-1002 is almost respectable a name. But xXxDragonSlayer69xXx? is hate to be named that.
Does the fact that you can arbitrarily “jailbreak” AI with increasingly sophisticated abilities ring any alarm bells?
Imagine being able to “jailbreak” nuclear warheads. If this were the case, nobody would develop or deploy them.
Was this written by AI?
If not, why not?
This is exactly why I make a huge exception for AI models, when it comes to open source software.
I've been a big advocate of open source, spending over $1M to build massive code bases with my team, and giving them away to the public.
But this is different. AI agents in the wrong hands are dangerous. The reason these guys were even able to detect this activity, analyze it, ban accounts, etc., is because the models are running on their own servers.
Now imagine if everyone had nuclear weapons. Would that make the world safer? Hardly. The probability of no one using them becomes infinitesimally small. And if everyone has their own AI running on their own hardware, they can do a lot of stuff completely undetected. It becomes like slaughterbots but online: https://www.youtube.com/watch?v=O-2tpwW0kmU
Basically, a dark forest.
[dead]
TL;DR - Anthropic: Hey people! We gave the criminals even bigger weapons. But don't worry, you can buy defense tools from us. Remember, only we can sell you the protection you need. Order today!
We believe this is the first documented case of a large-scale cyberattack executed without substantial human intervention.
The Morris worm already worked without human intervention. This is Script Kiddies using Script Kiddie tools. Notice how proud they are in the article that the big bad Chinese are using their toolz.
EDIT: Yeah Misanthropic, go for -4 again you cheap propagandists.
They're spinning this as a positive learning experience, and trying to make themselves look good. But, make no mistake, this was a failure on Anthropic's part to prevent this kind of abuse from being possible through their systems in the first place. They shouldn't be earning any dap from this.
China needs to understand that this kind of espionage is a declaration of war
If Anthropic should have prevented this, then logically they should’ve had guardrails. Right now you can write whatever code you want. But to those who advocate guardrails, keep in mind that you’re advocating a company to decide what code you are and aren’t allowed to write.
Hopefully they’ll be able to add guardrails without e.g. preventing people from using these capabilities for fuzzing their own networks. The best way to stay ahead of these kinds of attacks is to attack yourself first, aka pentesting. But if the large code models are the only ones that can do this effectively, then it gets weird fast. Imagine applying to Anthropic for approval to run certain prompts.
That’s not necessarily a bad thing. It’ll be interesting to see how this plays out.
This feels a lot like aiding & abetting a crime.
> Claude identified and tested security vulnerabilities in the target organizations’ systems by researching and writing its own exploit code
> use Claude to harvest credentials (usernames and passwords)
Are they saying they have no legal exposure here? You created bespoke hacking tools and then deployed them, on your own systems.
Are they going to hide behind the old, "it's not our fault if you misuse the product to commit a crime that's on you".
At the very minimum, this is a product liability nightmare.
I don't understand why they would even disclose this, maybe it's useful for PR purposes so they can tell regulators "oh we are so safe", but people (including HN posters) can and will draw the wrong conclusion that Anthropic was backdoored and that their data is unsafe.
Ok great, people tried to use your AI to do bad things, and your safety rails mostly stopped them. There are 10 other providers with different safety rails, there are open models out there with no rails at all. If AI can be used to do bad things, it will be used to do bad things.
> At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails. They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.
Guardrails in AI are like a $2 luggage padlock on a bicycle in the middle of nowhere. Even a moron, given enough time, and a little dedication, will defeat it. And this is not some kind of inferiority of one AI manufacturer over another. It's inherent to LLMs. They are stupid, but they do contain information. You use language to extract information from them, so there will always be a lexicographical way to extract said information (or make them do things).
> This raises an important question: if AI models can be misused for cyberattacks at this scale, why continue to develop and release them? The answer is
Money.