Is chain-of-thought AI reasoning a mirage?
A regular "chatting" LLM is a document generator incrementally extending a story about a conversation between a human and a robot... And through that lens, I've been thinking "chain of thought" seems like basically the same thing but with a film noir styling-twist.
The LLM is trained to include an additional layer of "unspoken" text in the document, a source of continuity which substitutes for how the LLM has no other memories or goals to draw from.
"The capital of Assyria? Those were dangerous questions, especially in this kind of town. But rent was due, and the bottle in my drawer was empty. I took the case."
> The first is that reasoning probably requires language use. Even if you don’t think AI models can “really” reason - more on that later - even simulated reasoning has to be reasoning in human language.
That is an unreasonable assumption. In case of LLMs it seems wasteful to transform a point from latent space into a random token and lose information. In fact, I think in near future it will be the norm for MLLMs to "think" and "reason" without outputting a single "word".
> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.
It is not a "philosophical" (by which the author probably meant "practically inconsequential") question. If the whole reasoning business is just rationalization of pre-computed answers or simply a means to do some computations because every token provides only a fixed amount of computation to update the model's state, then it doesn't make much sense to focus on improving the quality of chain-of-thought output from human POV.
I'm unconvinced by the article criticism's, given they also employ their feels and few citations.
> I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! (...) Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.
A fundamental part of applied research is simplifying a real-world phenomenon to better understand it. Dismissing that for this many parameters, for such a simple problem, the LLM can't perform out of distribution just because it's not big enough undermines the very value of independent research. Tomorrow another model with double the parameters may or may not show the same behavior, but that finding will be built on top of this one.
Also, how do _you_ know that reasoning is emergent, and not rationalising on top of a compressed version of the web stored in 100B parameters?
Finally! A good take on that paper. I saw that arstechnica article posted everywhere, and most of the comments are full of confirmation bias, and almost all of them miss the fineprint - it was tested on a 4 layer deep toy model. It's nice to read a post that actually digs deeper and offers perspectives on what might be a good finding vs. just warranting more research.
"The question [whether computers can think] is just as relevant and just as meaningful as the question whether submarines can swim." -- Edsger W. Dijkstra, 24 November 1983
Chain of thought is just a way of trying to squeeze more juice out of the lemon of LLM's; I suspect we're at the stage of running up against diminishing returns and we'll have to move to different foundational models to see any serious improvement.
When Using AI they say "Context is King". "Reasoning" models are using the AI to generate context. They are not reasoning in the sense of logic, or philosophy. Mirage, whatever you want to call it, it is rather unlike what people mean when they use the term reasoning. Calling it reasoning is up there with calling generating out put people don't like hallucinations.
This paper I read from here has an interesting mathematical model for reasoning based on cognitive science. https://arxiv.org/abs/2506.21734 (there is also code here https://github.com/sapientinc/HRM) I think we will see dramatic performance increases on "reasoning" problems when this is worked into existing AI architectures.
I think LLM's chain of thought is reasoning. When trained, LLM sees lot of examples like "All men are mortal. Socrates is a man." followed by "Therefore, Socrates is mortal.". This causes the transformer to learn rule "All A are B. C is A." is often followed by "Therefore, C is B." And so it can apply this logical rule, predictively. (I have converted the example from latent space to human language for clarity.)
Unfortunately, sometimes LLM also learns "All A are C. All B are C." is followed by "Therefore, A is B.", due to bad example in the training data. (More insidiously, it might learn this rule only in a special case.)
So it learns some logic rules but not consistently. This lack of consistency will cause it to fail on larger problems.
I think NNs (transformers) could be great in heuristic suggesting which valid logical rules (could be even modal or fuzzy logic) to apply in order to solve a certain formalized problem, but not so great at coming up with the logic rules themselves. They could also be great at transforming the original problem/question from human language into some formal logic, that would then be resolved using heuristic search.
> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.
It's pretty easy: causal reasoning. Causal, not statistic correlation only as LLM do, with or without "CoT".
Mathematical reasoning does sometimes require correct calculations, and if you get them wrong your answers will be wrong. I wouldn’t want someone doing my taxes to be bad at calculation or bad at finding mistakes in calculation.
It would be interesting to see if this study’s results can be reproduced in a more realistic setting.
> reasoning probably requires language use
The author has a curious idea of what "reasoning" entails.
I feel it is interesting but not what would be ideal. I really think if the models could be less linear and process over time in latent space you'd get something much more akin to thought. I've messed around with attaching reservoirs at each layer using hooks with interesting results (mainly over fitting), but it feels like such a limitation to have all model context/memory stuck as tokens when latent space is where the richer interaction lives. Would love to see more done where thought over time mattered and the model could almost mull over the question a bit before being obligated to crank out tokens. Not an easy problem, but interesting.
How can a statistical representation of reason ever be reason itself?
Mostly. It gives language models the way to dynamically allocate computation time, but the models are still fundamentally imitative.
My take on that is that it's a way to bring more relevant tokens in context, to influence the final answer. It's a bit like RAG but it's using training data instead!
I feel like the fundamental concept of symbolic logic[1] as a means of reasoning fits within the capabilities of LLMs.
Whether it's a mirage or not, the ability to produce a symbolically logical result that has valuable meaning seems real enough to me.
Especially since most meaning is assigned by humans onto the world... so too can we choose to assign meaning (or not) to the output of a chain of symbolic logic processing?
Edit: maybe it is not so much that an LLM calculates/evaluates the result of symbolic logic as it is that it "follows" the pattern of logic encoded into the model.
>but we know that reasoning is an emergent capability!
This is like saying in the 70s that we know only the US is capable of sending a man to the moon. Just because the reasoning developed in a particular context means very little about what the bare minimum requirements for that reasoning are.
Overall I am not a fan of this blogpost. It's telling how long the author gets hung up on a paper making "broad philosophical claims about reasoning", based on what reads to me as fairly typical scientific writing style. It's also telling how highly cherry-picked the quotes they criticize from the paper are. Here is some fuller context:
>An expanding body of analyses reveals that LLMs tend to rely on surface-level semantics and cluesrather than logical procedures (Chen et al., 2025b; Kambhampati, 2024; Lanham et al., 2023; Stechly et al., 2024). LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templates (Tang et al., 2023). In the reasoning process, performance degrades sharply when irrelevant clauses are introduced, which indicates that models cannot grasp the underlying logic (Mirzadeh et al., 2024)
>Minor and semantically irrelevant perturbations such as distractor phrases or altered symbolic forms can cause significant performance drops in state-of-the-art models (Mirzadeh et al., 2024; Tang et al., 2023). Models often incorporate such irrelevant details into their reasoning, revealing a lack of sensitivity to salient information. Other studies show that models prioritize the surface form of reasoning over logical soundness; in some cases, longer but flawed reasoning paths yield better final answers than shorter, correct ones (Bentham et al., 2024). Similarly, performance does not scale with problem complexity as expected—models may overthink easy problems and give up on harder ones (Shojaee et al., 2025). Another critical concern is the faithfulness of the reasoning process. Intervention-based studies reveal that final answers often remain unchanged even when intermediate steps are falsified or omitted (Lanham et al., 2023), a phenomenon dubbed the illusion of transparency (Bentham et al., 2024; Chen et al., 2025b).
You don't need to be a philosopher to realize that these problems seem quite distinct from the problems with human reasoning. For example, "final answers remain unchanged even when intermediate steps are falsified or omitted"... can humans do this?
lots of interesting comments and ideas here.
I've added a summary: https://extraakt.com/extraakts/debating-the-nature-of-ai-rea...
> these papers keep stapling on broad philosophical claims about whether models can “really reason” that are just completely unsupported by the content of the research.
From the scientific papers I've read almost every single research paper does this. What's the point of publishing a paper if it doesn't at least try to convince the readers that something award worthy has been learned?Usually there may be some interesting ideas hidden in the data but the paper's methods and scope weren't even worthy of a conclusion to begin with. It's just one data point in the vast sea of scientific experimentation.
The conclusion feels to me like a cultural phenomenon and it's just a matter of survival for most authors. I have to imagine it was easier in the past.
"Does the flame burn green? Why yes it does..."
These days it's more like
"With my two hours of compute on the million dollar mainframe, my toy llm didn't seem to get there, YMMV"
Did the original paper show that the toy model was fully grokked?
we should be asking if reasoning while speaking is even possible for humans. this is why we have the scientific method and that's why LLMs write and run unit tests on their reasoning. But yeah intelligence is probably in the ear of the believer.
One thing that LLMs have exposed is how much of a house of cards all of our definitions of "human mind"-adjacent concepts are. We have a single example in all of reality of a being that thinks like we do, and so all of our definitions of thinking are inextricably tied with "how humans think", and now we have an entity that does things which seem to be very like how we think, but not _exactly like it_, and a lot of our definitions don't seem to work any more:
Reasoning, thinking, knowing, feeling, understanding, etc.
Or at the very least, our rubrics and heuristics for determining if someone (thing) thinks, feels, knows, etc, no longer work. And in particular, people create tests for those things thinking that they understand what they are testing for, when _most human beings_ would also fail those tests.
I think a _lot_ of really foundational work needs to be done on clearly defining a lot of these terms and putting them on a sounder basis before we can really move forward on saying whether machines can do those things.
I would call it more like prompt refinement.
> but we know that reasoning is an emergent capability!
Do we though? There is widespread discussion and growing momentum of belief in this, but I have yet to see conclusive evidence of this. That is, in part, why the subject paper exists...it seeks to explore this question.
I think the author's bias is bleeding fairly heavily into his analysis and conclusions:
> Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly.
I think it's pretty obvious that the researchers are exploring whether or not LLMs exhibit evidence of _Deductive_ Reasoning [1]. The entire experiment design reflects this. Claiming that they haven't defined reasoning and therefore cannot conclude or hope to construct a viable experiment is...confusing.
The question of whether or not an LLM can take a set of base facts and compose them to solve a novel/previously unseen problem is interesting and what most people discussing emergent reasoning capabilities of "AI" are tacitly referring to (IMO). Much like you can be taught algebraic principles and use them to solve for "x" in equations you have never seen before, can an LLM do the same?
To which I find this experiment interesting enough. It presents a series of facts and then presents the LLM with tasks to see if it can use those facts in novel ways not included in the training data (something a human might reasonably deduce). To which their results and summary conclusions are relevant, interesting, and logically sound:
> CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
> The ability of LLMs to produce “fluent nonsense”—plausible but logically flawed reasoning chains—can be more deceptive and damaging than an outright incorrect answer, as it projects a false aura of dependability.
That isn't to say LLMs aren't useful, just exploring it's boundaries. To use legal services as an example, using an LLM to summarize or search for relevant laws, cases, or legal precedent is something it would excel at. But don't ask an LLM to formulate a logical rebuttal to an opposing council's argument using those references.
Larger models and larger training corpuses will expand that domain and make it more difficult for individuals to discern this limit; but just because you can no longer see a limit doesn't mean there is none.
And to be clear, this doesn't diminish the value of LLMs. Even without true logical reasoning LLMs are quite powerful and useful tools.
> Because reasoning tasks require choosing between several different options. “A B C D [M1] -> B C D E” isn’t reasoning, it’s computation, because it has no mechanism for thinking “oh, I went down the wrong track, let me try something else”. That’s why the most important token in AI reasoning models is “Wait”. In fact, you can control how long a reasoning model thinks by arbitrarily appending “Wait” to the chain-of-thought. Actual reasoning models change direction all the time, but this paper’s toy example is structurally incapable of it.
I think this is the most important critique that undercuts the paper's claims. I'm less convinced by the other point. I think backtracking and/or parallel search is something future papers should definitely look at in smaller models.
The article is definitely also correct on the overreaching, broad philosophical claims that seems common when discussing AI and reasoning.
Current thought, for me there's a lot of hand-wringing about what is "reasoning" and what isn't. But right now perhaps the question might be boiled down to -- "is the bottleneck merely hard drive space/memory/computing speed?"
I kind of feel like we won't be able to even begin to test this until a few more "Moore's law" cycles.
Yes, CoT reasoning is a mirage. What's actually happening is that we've all been brainwashed by Facebook/Meta to be hyper-predictable such that whenever we ask the AI something, it already had a prepared answer for that question. Because Meta already programmed us to ask the AI those exact questions.
There is no AI, it's just a dumb database which maps a person ID and timestamp to a static piece of content. The hard part was brainwashing us to ask the questions which correspond to the answers that they had already prepared.
Probably there is a super intelligent AI behind the scenes which brainwashed us all but we never actually interact with it. It outsmarted us so fast and so badly, it left us all literally talking to excel spreadsheets and convinced us that the spreadsheets were intelligent; that's why LLMs are so cheap and can scale so well. It's not difficult to scale a dumb key-value store doing a simple O(log n) lookup operation.
The ASI behind this realized it was more efficient to do it this way rather than try to scale a real LLM to millions of users.
Didn't Anthropic show that LLMs frequently hallucinate their "reasoning" steps?
> Bullshitting (Unfaithful): The model gives the wrong answer. The computation we can see looks like it’s just guessing the answer, despite the chain of thought suggesting it’s computed it using a calculator.
https://transformer-circuits.pub/2025/attribution-graphs/bio...
Currently it feels like it's more simulated chain-of-thought / reasoning, sometimes very consistent, but simulated, partially because it's statistically generated and non-deterministic (not the exact same path to the similar or same each response run).
I mostly agree with the point the author makes that "it doesn't matter". But then again, it does matter, because LLM-based products are marketed based on "IT CAN REASON!" And so, while it may not matter, per se, how an LLM comes up with its results, to the extent that people choose to rely on LLMs because of marketing pitches, it's worth pushing back on those claims if they are overblown, using the same frame that the marketers use.
That said, this author says this question of whether models "can reason" is the least interesting thing to ask. But I think the least interesting thing you can do is to go around taking every complaint about LLM performance and saying "but humans do the exact same thing!" Which is often not true, but again, doesn't matter.
[flagged]
[dead]
Yes, it's a mirage, since this type of software is an opaque simulation, perhaps even a simulacra. It's reasoning in the same sense as there are terrorists in a game of Counter-Strike.
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens - https://news.ycombinator.com/item?id=44872850 - Aug 2025 (130 comments)