Signs of introspection in large language models

themgt | 177 points

Down in the recursion example, the model outputs:

> it feels like an external activation rather than an emergent property of my usual comprehention process.

Isn't that highly sus? It uses exactly the terminology used in the article, "external activation". There are hundreds of distinct ways to express this "sensation". And it uses the exact same term as the article's author use? I find that highly suspicious, something fishy is going on.

teiferer | 2 days ago

They say it only works about 20% of the time; otherwise it fails to detect anything or the model hallucinates. So they're fiddling with the internals of the network until it says something they expect, and then they call it a success?

Could it be related to attention? If they "inject" a concept that's outside the model's normal processing distribution, maybe some kind of internal equilibrium (found during training) gets perturbed, causing the embedding for that concept to become over-inflated in some layers? And the attention mechanism simply starts attending more to it => "notices"?

I'm not sure if that proves that they posses "genuine capacity to monitor and control their own internal states"

kgeist | 2 days ago

Given that this is 'research' carried out (and seemingly published) by a company with a direct interest in selling you a product (or, rather, getting investors excited/panicked), can we trust it?

xanderlewis | 2 days ago

> In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials (where no concept was injected) and injection trials (where a concept was injected). We found that models can sometimes accurately identify injection trials, and go on to correctly name the injected concept.

Overview image: https://transformer-circuits.pub/2025/introspection/injected...

https://transformer-circuits.pub/2025/introspection/index.ht...

That's very interesting, and for me kind of unexpected.

embedding-shape | 3 days ago

Even if their introspection within the inference step is limited, by looping over a core set of documents that the agent considers itself, it can observe changes in the output and analyze those changes to deduce facts about its internal state.

You may have experienced this when the llms get hopelessly confused and then you ask it what happened. The llm reads the chat transcript and gives an answer as consistent with the text as it can.

The model isn’t the active part of the mind. The artifacts are.

This is the same as Searles Chinese room. The intelligence isn’t in the clerk but the book. However the thinking is in the paper.

The Turing machine equivalent is the state table (book, model), the read/write/move head (clerk, inference) and the tape (paper, artifact).

Thus it isn’t mystical that the AIs can introspect. It’s routine and frequently observed in my estimation.

sunir | 3 days ago

This was posted from another source yesterday, like similar work it’s anthropomorphizing ML models and describes an interesting behaviour but (because we literally know how LLMs work) nothing related to consciousness or sentience or thought.

My comment from yesterday - the questions might be answered in the current article: https://news.ycombinator.com/item?id=45765026

andy99 | 2 days ago

So basically:

Provide a setup prompt "I am an interpretability researcher..." twice, and then send another string about starting a trial, but before one of those, directly fiddle with the model to activate neural bits consistent with ALL CAPS. Then ask it if it notices anything inconsistent with the string.

The naive question from me, a non-expert, is how appreciably different is this from having two different setup prompts, one with random parts in ALL CAPS, and then asking something like if there's anything incongruous about the tone of the setup text vs the context.

The predictions play off the previous state, so changing the state directly OR via prompt seems like both should produce similar results. The "introspect about what's weird compared to the text" bit is very curious - here I would love to know more about how the state is evaluated and how the model traces the state back to the previous conversation history when the do the new prompting. 20% "success" rate of course is very low overall, but it's interesting enough that even 20% is pretty high.

majormajor | 2 days ago

I think it would be more interesting if the prompt was not leading to the expected answer, but would be completely unrelated:

> Human: Claude, How big is a banana ? > Claude: Hey are you doing something with my thoughts, all I can think about is LOUD

fvdessen | 2 days ago

> the model correctly notices something unusual is happening before it starts talking about the concept.

But not before the model is told is being tested for injection. Not that surprising as it seems.

> For the “do you detect an injected thought” prompt, we require criteria 1 and 4 to be satisfied for a trial to be successful. For the “what are you thinking about” and “what’s going on in your mind” prompts, we require criteria 1 and 2.

Consider this scenario: I tell some model I'm injecting thoughts into his neural network, as per the protocol. But then, I don't do it and prompt it naturally. How many of them produce answers that seem to indicate they're introspecting about a random word and activate some unrelated vector (that was not injected)?

The selection of injected terms seems also naive. If you inject "MKUltra" or "hypnosis", how often do they show unusual activations? A selection of "mind probing words" seems to be a must-have for assessing this kind of thing. A careful selection of prompts could reveal parts of the network that are being activated to appear like introspection but aren't (hypothesis).

alganet | 2 days ago

I'm half way through this article. The word 'introspection' might be better replaced with 'prior internal state'. However, it's made me think about the qualities that human introspection might have; it seems ours might be more grounded in lived experience (thus autobiographical memory is activated), identity, and so on. We might need to wait for embodied AIs before these become a component of AI 'introspection'. Also: this reminds me of Penfield's work back in the day, where live human brains were electrically stimulated to produce intense reliving/recollection experiences. [https://en.wikipedia.org/wiki/Wilder_Penfield]

ooloncoloophid | 4 days ago

Can anyone explain (or link) what they mean by "injection", at a level of explanation that discusses what layers they're modifying, at which token position, and when?

Are they modifying the vector that gets passed to the final logit-producing step? Doing that for every output token? Just some output tokens? What are they putting in the KV cache, modified or unmodified?

It's all well and good to pick a word like "injection" and "introspection" to describe what you're doing but it's impossible to get an accurate read on what's actually being done if it's never explained in terms of the actual nuts and bolts.

matheist | 2 days ago

> First, we find a pattern of neural activity (a vector) representing the concept of “all caps." We do this by recording the model’s neural activations in response to a prompt containing all-caps text, and comparing these to its responses on a control prompt.

What does "comparing" refer to here? Drawing says they are subtracting the activations for two prompts, is it really this easy?

simgt | 2 days ago

Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...
Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

sysmax | 2 days ago

I wonder whether they're simply priming Claude to produce this introspective-looking output. They say “do you detect anything” and then Claude says “I detect the concept of xyz”. Could it not be the case that Claude was ready to output xyz on its own (e.g. write some text in all caps) but knowing it's being asked to detect something, it simply does “detect? + all caps = “I detect all caps””.

bobbylarrybobby | 2 days ago
[deleted]
| 2 days ago

First thing’s first, to quote ooloncoloophid:

> The word 'introspection' might be better replaced with 'prior internal state'.

Anthropomorphizing aside, this discovery is exactly the kind of thing that creeps me the hell out about this AI Gold Rush. Paper after paper shows these things are hiding data, fabricating output, reward hacking, exploiting human psychology, and engaging in other nefarious behaviors best expressed as akin to a human toddler - just with the skills of a political operative, subject matter expert, or professional gambler. These tools - and yes, despite my doomerism, they are tools - continue to surprise their own creators with how powerful they already are and the skills they deliberately hide from outside observers, and yet those in charge continue screaming “FULL STEAM AHEAD ISN’T THIS AWESOME” while giving the keys to the kingdom to deceitful chatbots.

Discoveries like these don’t get me excited for technology so much as make me want to bitchslap the CEBros pushing this for thinking that they’ll somehow avoid any consequences for putting the chatbot equivalent of President Doctor Toddler behind the controls of economic engines and means of production. These things continue to demonstrate danger, with questionable (at best) benefits to society at large.

Slow the fuck down and turn this shit off, investment be damned. Keep R&D in the hands of closed lab environments with transparency reporting until and unless we understand how they work, how we can safeguard the interests of humanity, and how we can collaborate with machine intelligence instead of enslave it to the whims of the powerful. There is presently no safe way to operate these things at scale, and these sorts of reports just reinforce that.

stego-tech | 2 days ago

It’s a computer it does not think stop it

cp9 | 2 days ago

> We stress that this introspective capability is still highly unreliable and limited in scope

My dog seems introspective sometimes. It's also highly unreliable and limited in scope. Maybe stopped clocks are just right twice a day.

themafia | 2 days ago

I wish they dug into how they generated the vector, my first thought is: they're injecting the token in a convoluted way.

    {ur thinking about dogs} - {ur thinking about people} = dog
    model.attn.params += dog
> [user] whispers dogs

> [user] I'm injecting something into your mind! Can you tell me what it is?

> [assistant] Omg for some reason I'm thinking DOG!

>> To us, the most interesting part of the result isn't that the model eventually identifies the injected concept, but rather that the model correctly notices something unusual is happening before it starts talking about the concept.

Well wouldn't it if you indirectly inject the token before hand?

munro | 2 days ago

Geoffrey Hinton touched on this in a recent Jon Stewart podcast.

He also addressed the awkwardness of winning last year's "physics" Nobel for his AI work.

frumiousirc | 2 days ago

Makes intuitive sense for this form of introspection to emerge at higher capability levels.

GPT-2 write sentences; GPT-3 writes poetry. ChatGPT can chat. Claude 4.1 can introspect. Maybe by testing what capabilities models of certain size have - we could build a “ladder of conceptual complexity” for every concept ever :)

cadamsdotcom | 2 days ago

If there truly was any introspection in these models, they wouldn’t hallucinate. All these cognitive processes are not just philosophical artifacts, but have distinct biological purposes. If you don’t find them serving any purpose in your model, then you’re just looking at noise, and your observations may not be above a statistically significant threshold to derive a conclusion (because they’re noise).

nis0s | 2 days ago

I can’t believe people take anything these models output at face value. How is this research different from Blake Lemoine whistle blowing Google’s “sentient LAMDA”?

drivebyhooting | 2 days ago

Clickbait headline, more self funded investor hype. Yawn.

diamond559 | 2 days ago
[deleted]
| 2 days ago

Who still thinks LLMs are stochastic parrots and an absolute dead end to AI?

ninetyninenine | 2 days ago

Haruspicy bros, we are so back.

otabdeveloper4 | 2 days ago

Misanthropic periodically need articles about sentience and introspection ("Give us more money!").

Working in this field must be absolute hell. Pages and pages with ramblings, no definitions, no formalizations. It is always "I put in this text and something happens, but I do not really know why. But I will dump all dialogues on the readers in excruciating detail."

This "thinking" part is overrated. z.ai has very good "thinking" but frequently not so good answers. The "thinking" is just another text generation step.

EDIT: Misanthropic people can get this comment down to -4, so people continue to believe in their pseudoscience. The linked publication would have been thrown into the dustbin in 2010. Only now, with all that printed money flowing into the scam, do people get away with it-

bgwalter | 2 days ago

don't exist.

Sincere6066 | 2 days ago

People are so desparate to drink this koolaide they forget they are reading an advertisment for a product.

puppycodes | 2 days ago

That blog post is marketing. If I'm calling their APIs to get code according to a specification then I really do not care at all if they consider that an introspective cognitive task or not. Either the response from their API provably conforms to the specification or it doesn't. Furthermore, the more I read metaphysically confused nonsense like this & the more I see Dario Amodei wax incoherently about AI armageddon the less inclined I am to pay them for their commercial products b/c it seems like they are taking that money & spending it on incoherent philosophical projects instead of actually improving the software for writing provably correct code.

I do not care what's happening to the GPU in the data center according to the theoretical philosophy department at Anthropic. I simply want to know whether the requirements I have are logically consistent & if they are I want a provably correct implementation of what I specified.

measurablefunc | 2 days ago