> AlphaEvolve achieved up to a 32.5% speedup for the FlashAttention kernel implementation in Transformer-based AI models
> In roughly 75% of cases, it rediscovered state-of-the-art solutions, to the best of our knowledge.
> And in 20% of cases, AlphaEvolve improved the previously best known solutions
These sound like incredible results. I'd be curious what kind of improvements were made / what the improvements were.
Like, was that "up to a 32.5% speedup" on some weird edge case and it was negligible speed up otherwise? Would love to see the benchmarks.
This is great.
But how incremental are these advancements?
I picked one at random (B.2 -- the second autocorrelation inequality). Then, I looked up the paper that produced the previous state of the art (https://arxiv.org/pdf/0907.1379). It turns out that the authors had themselves found the upper bound by performing a numerical search using "Mathematica 6" (p.4). Not only did the authors consider this as a secondary contribution (p.2), but they also argued that finding something better was very doable, but not worth the pain:
"We remark that all this could be done rigorously, but one needs to control the error arising from the discretization, and the sheer documentation of it is simply not worth the effort, in view of the minimal gain." (p.5)
So at least in this case it looks like the advancement produced by AlphaEvolve was quite incremental (still cool!).
Cool, but don't get me wrong, isn't this essentially similar to Google's Co-Scientist, where multiple models are in a loop, passing context back and forth validating things? At its core, it's still a system of LLMs, which is impressive in execution but not fundamentally new.
LLMs are undoubtedly useful at tasks like code "optimisation" and detecting patterns or redundancies that humans might overlook, but this announcement feels like another polished, hypey blog post from Google.
What's also becoming increasingly confusing is their use of the "Alpha" branding. Originally, it was for breakthroughs like AlphaGo or AlphaFold, where there was a clear leap in performance and methodology. Now it's being applied to systems that, while sophisticated, don't really rise to the same level of impact.
edit: I missed the evaluator in my description, but an evaluation method is applied also in Co-Scientist:
"The AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Key reasoning steps include self-play–based scientific debate for novel hypothesis generation, ranking tournaments for hypothesis comparison, and an "evolution" process for quality improvement."[0]
[0]: https://research.google/blog/accelerating-scientific-breakth...
One of the researchers quoted in Nature link here... in the past, when DeepMind published "AlphaTensor" [1][2] in October 2022, it took a single day (!!), see [3], for improvements to the AlphaTensor-based scheme to be discovered. This was then a few months later generalized into a significantly more comprehensive scheme [4]. I do not know whether the more general scheme that was discovered in [4] made its way back to some improved version of AlphaTensor - but this nonetheless shows that AlphaEvolve may also change, as it becomes absorbed by the community.
[1] Blog: https://deepmind.google/discover/blog/discovering-novel-algo...
[2] Paper: https://www.nature.com/articles/s41586-022-05172-4
[3] arxiv.org/pdf/2210.04045
[4] arxiv.org/abs/2212.01175 Flip graphs for matrix multiplication
(Reposted from here, where I made a mini deep-dive into this: https://x.com/friederrrr/status/1922846803420119410?t=7jZ34P...)
Interestingly, it seems alphaevolve has already been in use for a year, and it is just now being publicly shown. The paper also mentions that it uses Gemini 2.0 (pro and flash), which creates a situation where Gemini 2.0 was used in a way to train Gemini 2.5.
I don't know if I would call this the fabled "self improving feedback loop", but it seems to have some degree of it. It also begs the question if Alphaevolve was being developed for a year, or has been in production for a year. By now it makes sense to hold back on sharing what AI research gems you have discovered.
This is an important moment. We now have verifiable evidence that these systems can do new useful research that has actual value in the real world. That 1% savings is only the start as well. I would expect the compounding number of gains to be significant over some time. Also in a way this process was used to make gemini 2.5 pro better, so its like a baby step towards recursive self improvement. Not fully automated yet, but there are hints of where this is going.
I'm surprised by how little detail is given about the evolution procedure:
>In AlphaEvolve, the evolutionary database implements an algorithm that is inspired by a combination of the MAP elites algorithm [71] and island-based population models [80, 94].
"inspired by" is doing a lot of heavy lifting in this sentence. How do you choose dimensions of variation to do MAP-elites? How do you combine these two algorithms? How loose is the inspiration? It feels like a lot of the secret sauce is in the answers to these questions, and we get a single paragraph on how the evolution procedure works, which is so vague as to tell us almost nothing.
For the people awaiting the singularity, lines like this written almost straight from science fiction:
> By suggesting modifications in the standard language of chip designers, AlphaEvolve promotes a collaborative approach between AI and hardware engineers to accelerate the design of future specialized chips."
The paper does not give that many details about the evolution part. Normally, evolutionary algorithms contain some cross-over component where solutions can breed with each other. Otherwise it's better classified as hill climbing / beam search.
hallo see more> https://x.com/NChettriLeak/status/1923003123875955053
Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
Did you see that halluzination in the paper?
It optimized
initializers.normal (0.0
to initializers.normal (0 + 1j * 0,
I thought the results were being reviewed?Anyway, impressive results. That's why OpenAI and Elon were so frightened about Hassabi.
It's hard to stake out a defensible position on bold claims like these because, if they were as presented, it's hard to see how you haven't simply completed runaway AI.
Philosophically, let's say you talk an old LLM through a new discovery. Thanks to your instruction, the LLM now has access to "new" information not in its training data. It is certainly capable of this. The problem in is that this is just laundered human intelligence.
This looks like something that can (and should) be reimplemented open-source. It doesn't look like a particularly daunting project.
Show the training set, and PROVE that the tasks and answers aren't in there. I don't understand why this is not a default first step for proving that this is creating new knowledge.
Interesting to see Terence Tao in the authors list. I guess he's fully ai pilled now. Did he check the math results?
We are entering a new era of evolutionary algorithms and LLMs. Reminds me of the idea behind: https://github.com/DivergentAI/dreamGPT
Why do I get the feeling they are doing the "IBM Watson" thing where different efforts are being put underneath the same brand name?
Not saying it is that egregious, but it's a slippery slope from "well, it didn't do all these different things out of the box, unsupervised".
Interesting that this wasn't tested on ARC-AGI. Francois has always said he believed program search of this type was the key to solving it. It seems like potentially this approach could do very well.
This is a much better use of a AI than having it write college essays or generate cartoons.
> From the paper, "Notably, for multiplying two 4 × 4 matrices, applying the algorithm of Strassen recursively results in an algorithm with 49 multiplications, which works over any field...AlphaEvolve is the first method to find an algorithm to multiply two 4 × 4 complex-valued matrices using 48 multiplications."
...but Waksman's algorithm from 1970 [1] multiplies two 4 x 4 complex-valued matrices using only 46 multiplications (indeed, it works in any ring admitting division by 2).
Sloppy by DeepMind and by Nature to publish such a claim - did they not ask someone knowledgeable about matrix multiplication to review the work?
This is very neat work! Will be interested in how they make this sort of thing available to the public but it is clear from some of the results they mention that search + LLM is one path to the production of net-new knowledge from AI systems.
Maybe the actual solution to the interpretability/blackbox problem is to not ask the llm to execute a given task, but rather to write deterministic programs that can execute the task.
I wonder if evolvable hardware [0] is the next step.
In 1996, they optimized an FPGA using a genetic algorithm. It evolved gates disconnected from the circuit, but were required.
The circuit exploited the minuscule magnetic fields from the disconnected gates rather than the logical connections.
Does this remind anyone else of genetic algorithms?
Is this basically a merge of LLM's with genetic algorithm iteration?
This is Google solving Google-sized problems. I am afraid the rest of the world will look at this and say - "yeah we want to be like Google and adopt this". That is how Kubernetes took over the world.
> Here, the code between <<<<<<< SEARCH and======= is the exact segment to match in the current program version. The code between======= and >>>>>>> REPLACE is the new segment that will replace the original one. This allows for targeted updates to specific parts of the code.
Anybody knows how they can guarantee uniqueness of searched snipped within code block or is it even possible?
Finally—something directly relevant to my research (https://trishullab.github.io/lasr-web/). Below are my take‑aways from the blog post, plus a little “reading between the lines.”
- One lesson DeepMind drew from AlphaCode, AlphaTensor, and AlphaChip is that large‑scale pre‑training, combined with carefully chosen inductive biases, enables models to solve specialized problems at—or above—human performance.
- These systems still require curated datasets and experts who can hand‑design task‑specific pipelines.
- Conceptually, this work is an improved version of FunSearch (https://github.com/google-deepmind/funsearch/).
- In broad terms, FunSearch (and AlphaEvolve) follow three core design principles:
- Off‑the‑shelf LLMs can both generate code and recall domain knowledge. The “knowledge retrieval” stage may hallucinate, but—because the knowledge is expressed as code—we can execute it and validate the result against a custom evaluation function.
- Gradient descent is not an option for discrete code; a zeroth‑order optimizer—specifically evolutionary search—is required.
- During evolution we bias toward (1) _succinct_ programs and (2) _novel_ programs. Succinctness is approximated by program length; novelty is encouraged via a MAP‑Elites–style “novelty bias,” yielding a three‑dimensional Pareto frontier whose axes are _performance, simplicity,_ and _novelty_ (see e.g. OE‑Dreamer: (https://claireaoi.github.io/OE-Dreamer/).
Pros- Any general‑purpose foundation model can be coupled with evolutionary search.
- A domain expert merely supplies a Python evaluation function (with a docstring explaining domain‑specific details). Most scientists I've talked with - astronomers, seismologists, neuroscientists, etc. - already maintain such evaluation functions for their own code.
- The output is an interpretable program; even if it overfits or ignores a corner case, it often provides valuable insight into the regimes where it succeeds.
Cons
- Evolutionary search is compute‑heavy and LLM calls are slow unless heavily optimized. In my projects we need ≈ 60 k LLM calls per iteration to support a reasonable number of islands and populations. In equation discovery we offset cost by making ~99 % of mutations purely random; every extra 1 % of LLM‑generated mutations yields roughly a 10 % increase in high‑performing programs across the population.
- Evaluation functions typically undergo many refinement cycles; without careful curation the search may converge to a useless program that exploits loopholes in the metric.
Additional heuristics make the search practical. If your evaluator is slow, overlap it with LLM calls. To foster diversity, try dissimilar training: run models trained on different data subsets and let them compete. Interestingly, a smaller model (e.g., Llama-3 8 B) often outperforms a larger one (Llama‑3 70 B) simply because it emits shorter programs.
I'm sad not to see any mention of numerical stability. One of the hardest parts of all these automatic optimization of numerical algorithms is getting ensuring numerical stability. Once we have a strong handle on getting the best of both of those, it will be a delight.
Too bad the code isn't published. I would expect everything from DeepMind to be opensource, except model itself.
I'm surprised I'm not able to find this out - can some one tell me whether AlphaEvolve involves backprop or not?
I honestly have no idea how AlphaEvolve works - does it work purely on the text level? Meaning I might be able to come up with something like AlphaEvolve with some EC2's and a Gemini API access?
Surprised they didn't answer if they tried using AlphaEvolve to improve AlphaEvolve!
AlphaEvolve sounds like the AI version of a genius coder that never sleeps, insane potential!
Interestingly, they improved matrix multiplication and there was a paper on Arxiv a few days ago [1] that also improved matrix multiplication and the only case common to both is <4,5,6> (multiplying 4x5 matrix with 5x6 matrix) and they both improved it from 93 to 90.
It sounds to me like a hyperparameter optimizer (fast evaluator) guided by AI; I wonder if it's related to Google's Vizier
anyone else feel out-evolved yet?
What is an "advanced" algorithm? How do you differentiate this from other algorithms?
It seemed appropriate to use Gemini to make sure my answers were ideal for getting access to the preview.
Would love for AI to kill the leetcode interview
Can someone explain how an "agent" is distinct from a "chatbot"?
I'm reading descriptions of agents and it just seems like the same tech deployed with authority to write and a scheduler
Why it emphasize math and computer science more in training stage?
The problems in math and CS are more suitable for training LLMs?
That's 1 year ahead of the ai-2027.com schedule.
Good method to generate synthetic training data, but only works for domains where validation can be scaled up.
Has scifi covered anything after AI? Or do we just feed the beast with Dyson spheres and this is the end point of the intelligent universe?
Why Gemini(s)? Why not LLMs fine tuned for LARPing as a researcher?
A >2% bump in algorithmic performance is pretty impressive given the search approach.
Packing problems are hard, and it is fun to see new interest in the area given these show up in weird places. =3
I was here to witness the beginning of the end of growth for humanity.
sheeesh
I find it quite profound that there is no mention of the generation of corresponding code documentation. Without design diagrams, source and commit comments, etc the resulting code and changes will become incomprehensible unmaintainable. Unless that is somehow the point?
[dead]
[dead]
[dead]
[dead]
[flagged]
[flagged]
[flagged]
Software engineering will be completely solved. Even systems like v0 are astounding in their ability to generate code, and are very primitive to whats coming. I get downvoted on HN for this opinion, but its truly going to happen. Any system that can produce code, test the code, and iterate if needed will eventually outperform humans. Add in the reinforcement learning, where they can run the code, and train the model when it gets code generation right, and we are on our way to a whole different world.
Maybe this one can stop writing a fucking essay in code comments.
I'm now no longer surprised just how consistently all the gemini models overcomplicate coding challenges or just plain get them wrong.
Claude is just consistently spot on. A few salient comments for tricky code instead of incessantly telling me what it's changed and what I might want to do, incorrect assumptions when it has the code or is something we've discussed, changing large amounts of unrelated code (eg styles). I could go on.
Shame I'm too tight to pay for Claude RN though...
AlphaEvolve is confirming evidence of an intelligence explosion.
The key ingredient for an intelligence explosion is AI accelerating development of AI.
This is it. It’s happening.
From the paper, "Notably, for multiplying two 4 × 4 matrices, applying the algorithm of Strassen recursively results in an algorithm with 49 multiplications, which works over any field...AlphaEvolve is the first method to find an algorithm to multiply two 4 × 4 complex-valued matrices using 48 multiplications."
If you do naive matrix multiplication, you get a sense that you're doing similar work multiple times, but it's hard to quantify just what that duplicated work entails. Compare it to, for example, calculating the size of the union of two sets:
Total size = size(A) + size(B) - size(intersection(A, B))
You have to take out that extra intersection amount because you've counted it twice. What if you could avoid counting it twice in the first place? That's easy, you just iterate over each set once, keeping track of the elements you've already seen.
Strassen's algorithm keeps track of calculations that are needed later on. It's all reminiscent of dynamic programming.
What I find interesting is that it seems the extra savings requires complex values. There must be something going on in the complex plane that is again over-counting with the naive approach.