A Knockout Blow for LLMs?
"They're super expensive pattern matchers that break as soon as we step outside their training distribution" - I find it really weird that things like these are seen as some groundbreaking endgame discovery about LLMs
LLMs have a real issues with polarisation. It's probably smart people saying all this stuff about knockout blows, and LLM uselessness, but I find them really useful. Is there some emperor's new clothes type thing going on here - am I just a dumbass who can't see he's excited at a random noise generator?
It's like if I saw a headline about a knockout blow for cars because SomeBigBame discovered it's possible to crash them.
It wouldn't change my normal behaviour, it would just make me think "huh, I should avoid anything SomeBigName is doing with cars then if they only just realised that."
From a quick glance it seems to be about spatial reasoning problems. I think there is good reasons for why it's tricky to become extremely good at these from being trained on text and static images. Future models being further multimodally trained with video and then physics simulators should deal with this much better I think.
There's a recent talk about this by Jim Fan from Nvidia https://youtu.be/_2NijXqBESI
It would be awesome to develop some theory around what kind of problems LLMs can and cannot solve. That should deter some leads pushing for solving the unsolvable with the technology.
That being said, this isn’t a knockout blow by any stretch. The strength of LLMs lies in the people who are excited about them. And there’s a perfect reinforcing mechanism for the excitement - the chatbots that use the models.
Admit for a second that you’re a human with biases. If you see something more frequently, you’ll think it’s more important. If you feel good when doing something, you’ll feel good about that thing. If all your friends say something, you’re likely to adopt it as your own belief.
If you have a chatbot that can talk to you more coherently than anyone you’ve ever met, and implement these two nested loops that you’ve always struggled with, you’re poised to become a fan, an enthusiast. You start to believe.
And belief is power. As in the case of neuroscience development not being able to retire the concept of the dualism of body and soul, so will the testing of LLMs not be able to retire the concept of AI poised to dominate everything soon.
The first figure in the paper with Accuracy vs Complexity makes the whole point moot. The authors find that the performance of Claude 3.7 collapses around complexity 3 while Claude 3.7 thinking collapsed around complexity 7. A massive improvement in the complexity horizon that can be dealt with. It's real, it's quantitative, so what's the point of philosophical atguments about whether it is truly "reasoning" or not. All LLMs have various horizons, a context horizon/length, a complexity horizon etc. Reasoning pushes this out further, but not to some infinite algorithmically perfect recurrent reasoning effect. But I bet humans pretty much just have a complexity horizon of 12 or 20 or whatever and bigger models trained on bigger data with bigger reasoning posttraining and better distillation will push the horizons further and further.
how about lets stop making AI (i guess LLM here) some monolithic block that applies to every problem we could ever have on this earth? claude might suck at tower of hanoi, but in the end we will just use something better suited to the job. nobody complains that ML or vision models "fail spectacularly" at soemthing it's not suited for.
i get the criticism, but "on the ground" there's real stuff getting done that couldn't be done before. all of this boils down to an intellectual study which, while good to know, is meaningless in the long run. the only thing that matters is if the dollars put in can be recouped to the level of hype created and that answer is probably "maybe" in some areas but not others.
this AI doomerism is getting just as annoying as people claiming AI will replace everyone and everything.
I always assumed LLMs would be one component of “AGI”, but there would be “coprocessors” like logic engines or general purpose code interpreters that would be driven by code or data produced by LLMs just in time.
> neural networks of various kinds can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.
so the AI companies really took that to heart and tried to put everything into the training distribution. My stuff, your stuff, their stuff. I remember the good old days of feeding the wikimedia dump into a markov chain.
The paper shows reasoning is better than no reasoning, reasoning needs more tokens to work for simple tasks, and that models get confused when things get too complicated. Nothing interesting, on the level of what an undergrad would write for a side project. If it wasn’t “from apple” no one would be mentioning it.
Oh, another LLM skepticism paper from Apple.
This paper from last year doesn't age well due to rapid proliferation of reasoning models.
Related discussion: https://news.ycombinator.com/item?id=44203562
Argh please stop. Everyone knows LLMs aren't AGI currently and they have annoying limitations like hallucinations. Even the "giving up" thing was known before Apple's paper.
You aren't winning anything by saying "aha! I told you they are useless!" because they demonstrably aren't.
Yes everybody is hoping that someone will come up with a better algorithm that solves these problems but until they do it's a little like complaining about the invention of the railway because it can only go on tracks while humans can go pretty much anywhere.
Just before you write your comment, consider that the author very specifically says "LLM are not useless".
Why would they need to execute the algorithm? That feels like complaining your fork doesn’t cut things like a knife would…
He talks about the limits of reasoning with the tower of Hanoi game. So I asked Gemini to make it — and then to solve it. You can try it yourself: https://g.co/gemini/share/eb8b68d1dace
I made a zero-copy video recorder in C for Linux, I barely know anything about C, pointers or vulkan at all. LLM's are improving rapidly.
Algorithms made to imitate humans exhibit human weaknesses. What a terrible unexpected outcome! I love how the article is written but it is literally proving the opposite of it's premise.
Playbook:
1) you want to "disprove" some version of AI. Doesn't really matter what.
Take a problem humans face. For example, an almost total inability to follow simple rules for a long time to make a calculation. It's almost impossible to get a human to do this.
Check if AI algorithms, which are algorithms made to imitate humans have this same problem. Now of course, in practice if they indeed have that problem, that is actually a success: algorithm made to imitate humans ... imitates humans succesfully, strengths and weaknesses! But of course, if you find it, you describe it as total proof this algorithm is worthless.
An easy source for these problems is of course computers. Anything humans use computers for ... it's because humans suck at doing it themselves. Keeping track of history or facts. Exact calculation. Symbolic computation. Logic (ie. exactly correct answers). More generally math and even positive sciences as a whole are an endless supply of such problems.
2) you want to "prove" some version of AI.
Find something humans are good at. Point out AIs do this. Humans are social animals so how about influencing other humans? From convincing your boss, or on a larger scale using a social network to win an election, right up to actual seduction. Use what humans use to do it, of course (ie. be inaccurate, lie, ...)
Point out what a great success this is. How magical it is that machines can now do this.
3) you want to make a boatload of money
Take something humans are good at but hate, have an AI do it for money.
In other news, water is wet.
I don't think anybody who uses LLMs professionally day-to-day thinks that it can reason like human beings... If some people thought this, they fundamentally do not understand how LLMs work under the hood.
Citing a few points to justify my own conclusion:
> Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs.
> LLMs are no substitute for good well-specified conventional algorithms.
> will continue have their uses, especially for coding and brainstorming and writing
> But anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves.
I agree with the assessment but disagree with the conclusion:
Being good at coding, writing, etc is precisely the sort of labor that is both “general intelligence” and will radically change society when clerical jobs are mechanized — and their ability to write (and interface with) classical algorithms to buttress their performance will only improve.
This is like when machines came for artisans.
The material finding of this paper is that reasoning models are better than non-reasoning models at solving puzzles of intermediate complexity (where that's defined, essentially, by how many steps are required), but that performance collapses past a certain threshold. This threshold differs for different puzzle types. It occurs even if a model is explicitly supplied with an algorithm it can use to solve the puzzle, and it's not a consequence of limited context window size.
The authors speculate that this pattern is a consequence of reasoning models actually solving these puzzles by way of pattern-matching to training data, which covers some puzzles at greater depth than others.
Great. That's one possible explanation. How might you support it?
- You could systematically examine the training data, to see if less representation of a puzzle type there reliably correlates with worse LLM performance.
- You could test how successfully LLMs can play novel games that have no representation in the training data, given instructions.
- Ultimately, using mechanistic interpretability techniques, you could look at what's actually going on inside a reasoning model.
This paper, however, doesn't attempt any of these. People are getting way out ahead of the evidence in accepting its speculation as fact.