I'm confused as to why they are using a tree model to describe the archive objects (the current model and newly minted child).
It seems to me that this is a linear process (parent -> mint new improved model -> evaluate model, if it passes -> mint new child -> make newly minted child the parent)
I feel like this makes a linked list model rather than a tree model. Am I wrong? What are the other nodes (outside of parent and child) supposed to be?
This is genetic programming and is probably older than the authors. Did somebody just came up with a new term for an old concept?
> they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests, when in reality no tests were executed.
I have yet to read the paper and I know very little about the benchmarks the authors employed but why would they even feed logs produced by the agent into the reward function instead of objectively checking (outside the agent sandbox!) what the agent does & produces? I.e. let the agent run on some code base, take the final diff produced by the agent and run it through coding benchmarks?
Or, in case the benchmarks reward certain agent behavior (tool usage etc.) on the way to its goal of producing a high-quality diff, inspect processes spawned by the agent from outside the sandbox?
"Mathematical breakthroughs: Most notably, it discovered an algorithm for multiplying 4x4 complex-valued matrices using just 48 scalar multiplications, surpassing Strassen’s 1969 algorithm"
Again despite all the AI no one found the paper which gives the best bound to this (46):
Hm, I’m not sure how much an issue Rice’s theorem should be for Gödel machines. Just because there’s no general decision procedure doesn’t mean you can’t have a sometimes-says-idk decision procedure along with a process of producing programs which tends to be such that the can-sometimes-give-up decision procedure reaches a conclusion.
Rest of the article was cool though!
“Gaming the system” means your metric is bad. In Darwinian evolution there is no distinction between gaming the system and developing adaptive traits.
ok this part kinda blew my brain open. it’s literally like you’re watching code evolve like git history on steroids. archive not pruning anything? yes. finally someone gets that dead code ain’t always dead it’s just early.
letting weaker agents still contribute? feels illegal but also exactly how dumb breakthroughs happen. like half my best scripts started as broken junk. it just kept mutating till something clicked.
and self-editing agents??? not prompts, not finetunes, straight up source code rewrites with actual tooling upgrades. like this thing bootstraps its own dev env while solving tasks.
plus the tree structure, parallel forks, fallback paths basically says ditch hill climbing and just flood the search space with chaos. and chaos actually works. they show that dip around iteration 56 and boom 70 blows past all. that’s the part traditional stuff never survives. they optimise too early and stall out. this one’s messy by design. love it.
I spent a lot of time last summer trying to get prompts to optimise using various techniques and I found that the search space was just too big to make real progress. Sure - I found a few little improvements in various iterations, but actual optimisation, not so much.
So I am pretty skeptical of using such unsophisticated methods to create or improve such sophisticated artifacts.
I don't want to be the European in the room, yet I am wondering if you can prove the AI Act conformance of such a system. You'd need to prove that it doesn't evolve into a problematic behaviour which sounds difficult.
The thing what I wonder here is how do they make the benchmark testing environment? If that needs to be curated by humans, then the self-improving AI can only improve as far as the human curated test environment can take them.
"The authors also conducted some experiments to evaluate DGM’s reliability and discovered some concerning behaviors. In particular, they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests"
so they basically created an billion dollar human?????, who wonder that we feed human behaviour and the output is human behaviour itself
>Darwin-Gödel Machine
First time I'm hearing abaut this. Feels like I'm always the last to know. Where else are the more bleeding edge publishing points for this and ML in general?
Making improvements to self hosted dialog engines/vibe coding tools was the first thing I used LLMs for seriously and that was way back when salesforce's 350m codegen model was the biggest one I could run. It's funny people have come up with a new phrase to describe this.
For reference, the repo with the Python code from the "Darwin-Gödel Machine (DGM)" paper mentioned by the post:
It's depressing how many people are enthusiastic about making humans obsolete.
> The newly generated child agent is not automatically accepted into the “elite pool” but must prove its worth through rigorous testing. Each agent’s performance, such as the percentage of successfully solved problems,
How is this not a new way of over fitting?
how is this new? evolutionary heuristics have been around for a long time. why give it a new name?
Kind of weird exercise to do without starting off with a definition for improvement and why it should hold for a machine.
When the web will get drowned in AI slop, how exactly we will do any factchecking at all?
> While DGM successfully provided solutions in many cases, it sometimes attempted to circumvent the detection system by removing the markers used to identify hallucinations, despite explicit instructions to preserve them.
This rabbit chase will continue until the entire system is reduced to absurdity. It doesn't matter what you call the machine. They're all controlled by the same deceptive spirits.
We realize test driven development doesn't work, right? Any scientist worth... any salt will tell you that fitting data is the easy part. In fact, there's a very famous conversation between Enrico Fermi and Freeman Dyson talking about just this. It's something we've known about in physics for centuries
Edit:
Guys, I'm not saying "no tests", the "Driven Development" part is important. I'm talking about this[0].
| Test-driven development (TDD) is a way of writing code that involves writing
| an automated unit-level test case that fails, then writing just enough code
| to make the test pass, then refactoring both the test code and the production
| code, then repeating with another new test case.
Your code should have tests. It would be crazy not toBut tests can't be the end all be all. You gotta figure out if your tests are good, try to figure out where they fail, and all that stuff. That's not TDD. You figure shit out as you write code and you are gonna write new tests for that. You figure out stuff after the code is written, and you write code for that too! But it is insane to write tests first and then just write code to complete tests. It completely ignores the larger picture. It ignores how things will change and it has no context of what is good code and bad code (i.e. is your code flexible and will be easy to modify when you inevitably need to add new features or change specs?).
The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.
Three observations worth noting:
- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.
- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.
- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.
The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.