I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
This sounds interesting.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
Are there any existing scripts/ tools to use these evolutionary algorithms also at home with e.g. Codex/GPT-5 / Claude Code?
That's a super neat approach.
But the core issue seems to be: How do you come up with the fitness function that drives the evolutionary process without human intervention in the first place?
(I've tried something similar with a coding agent where I let the agent modify parts of its system prompt... But it got stuck very fast since there was no clear fitness function)
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
isn't the author actually overfitting a solution ? He'll sure beat ARC AGI, but that will be all.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
blank stare
>With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.
How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?
More generally, is it really solving the task if it's given a large number of attempts and an oracle to say whether it's correct? Humans can answer the questions in one shot and self-check the answer, whereas this is like trial and error with an external expert who tells you to try again.
This sounds like it is just slightly smarter than brute forcing your way to a solution.
Oh well, more support for my prediction: nobody will win a Nobel prize for reaching AGI.
The biggest issue I have with ARC-AGI is it's a visual problem. LLMs (even the newfangled multi-modal ones) are still far worse at vision than at purely text based problems. I don't think it's possible to build a test of purely text-based questions that would be easy for humans and hard for SOTA models. Yes, there's a few gotchas you can throw at them but not 500.
Those are bold claims
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
you would be interested in dSPY
Congrats, you made LLMs perform slightly better at a contrived puzzle. This finally proves that we've cracked intelligence and are well on our way towards AGI.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.