AI solves International Math Olympiad problems at silver medal level

ocfnash | 1370 points

So I am extremely hyped about this, but it's not clear to me how much heavy lifting this sentence is doing:

> First, the problems were manually translated into formal mathematical language for our systems to understand.

The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?

Smaug123 | a month ago

This is certainly impressive, but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.

But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.

It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?

Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?

necovek | a month ago

This is the real deal. AlphaGeometry solved a very limited set of problems with a lot of brute force search. This is a much broader method that I believe will have a great impact on the way we do mathematics. They are really implementing a self-feeding pipeling from natural language mathematics to formalized mathematics where they can train both formalization and proving. In principle this pipeline can also learn basic theory building like creating auxilliary definitions and Lemmas. I really think this is the holy grail of proof-assistance and will allow us to formalize most mathematics that we create very naturally. Humans will work podt-rigorously and let the machine asisst with filling in the details.

golol | a month ago

The lede is a bit buried: they're using Lean!

This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.

Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.

Ericson2314 | a month ago

A good brief overview here from Tim Gowers (a Fields Medallist, who participated in the effort), explaining and contextualizing some of the main caveats: https://x.com/wtgowers/status/1816509803407040909

michael_nielsen | a month ago

> ... but whenever IMO is brought up, a caveat should be put out: medals are awarded to 50% of the participants (high school students), with 1:2:3 ratio between gold, silver and bronze. That puts all gold and silver medalists among the top 25% of the participants.

yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.

this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.

suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !

signa11 | a month ago

I'm seriously jealous of the people getting paid to work on this. Sounds great fun and must be incredibly satisfying to move the state of the art forward like that.

fancyfredbot | a month ago

Machines have been better than humans at chess for decades.

Yet no one cares. Everyone's busy watching Magnus Carlsen.

We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.

This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.

cynicalpeace | a month ago

Theorem proving is a single-player game with an insanely big search space, I always thouht it would be solved long before AGI.

IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.

This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...

thrance | a month ago

Tangentially: I found it fascinating to follow along the solution to Problem 6: https://youtu.be/7h3gJfWnDoc (aquaesulian is a node to ancient name of Bath). There’s no advanced math and each step is quite simple, I’d guess on a medium 8th grader level.

Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.

I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!

Jun8 | a month ago

Once Gemini, the LLM, integrates with AlphaProof and AlphaGeometry 2, it might be able to reliably perform logical reasoning. If that's the case, software development might be revolutionized.

"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909

nopinsight | a month ago

> First, the problems were manually translated into formal mathematical language for our systems to understand. In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.

adverbly | a month ago
riku_iki | a month ago

The kicker with some of those math competition problems, there will be problems that reduce to finding all natural numbers for which some statement is true. These are almost always small numbers, less than 100 in most circumstances.

Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.

But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.

SJC_Hacker | a month ago

Some more context is provided by Tim Gowers on Twitter [1].

Since I think you need an account to read threads now, here's a transcript:

Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.

It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.

However, that statement needs a bit of qualifying.

The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.

If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.

Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.

Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.

As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.

However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.

It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.

So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.

That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.

Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.

It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.

But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.

The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".

However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.

1. https://x.com/wtgowers/status/1816509803407040909?s=46

robinhouston | a month ago

This is quite cool! I've found logical reasoning to be one of the biggest weak points of LLMs, nice to see that an alternative approach works better! I've tried to enlist gpt to help me play a android game called 4=10, where you solve simple math problems, and gpt was hilariously terrible at it. It would both break the rules I described, and make math mistakes, such as claiming 6*5-5+8=10

I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.

Also perhaps a LLM could do the translation step that is currently manual?

amarant | a month ago

The problems were first converted into a formal language. So they were partly solved by the AI

petters | a month ago

Reading into the details, the system is more impressive than the title. 100% of the algebra and geometry problems were solved. The remaining problems are of combinatorial types, which ironically more closely resembles software engineering work.

nitrobeast | a month ago

Sometimes I wonder if in 100 years, it's going to be surprising to people that computers had a use before AI...

gallerdude | a month ago

Wow, that's absolutely impressive to hear!

Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.

StefanBatory | a month ago

I'm still unclear whether the system used here is actually reasoning through the process of solving the problem, or brute forcing solutions with reasoning coming in during the mathematical proof of each potential proof.

Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?

_heimdall | a month ago

This is kind of an ideal use-case for AI, because we can say with absolute certainty whether their solution is correct, completely eliminating the problem of hallucination.

HL33tibCe7 | a month ago

It would be nice if on the page they included detailed descriptions of the proofs it came up with, more information about the capabilities of the system and insights into the training process...

If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.

majikaja | a month ago

We need to up the ante: Getting human-like performance on any task is not impressive in itself, what matters is superhuman, orders of magnitude above. These comparisons with humans in order create impressive sounding titles are disguising the fact that we are still at the stone age of intelligence.

seydor | a month ago

Coincidentally, I just posted about how well LLMs handle adding long strings of numbers: https://userfriendly.substack.com/p/discover-how-mistral-lar...

zhiQ | a month ago

I'm curious if we'll see a world where computers could solve math problems so easily, that we'll be overwhelmed by all the results and stop caring. The role of humans might change to asking the computer interesting questions that we care about.

dan_mctree | a month ago

> As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise.

Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.

0xd1r | a month ago

Except it didn’t. The problem statements were hand-encoded into a formal language by human experts, and even then only one problem was actually solved within the time limit. So, claiming the work was “silver medal” quality is outright fraudulent.

skywhopper | a month ago

> The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.

I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...

1024core | a month ago

It’s like bringing a rocket launcher to a fist fight but I’d like to use these math language models to find gaps in logic when people are making online arguments. It would be an excellent way to verify who has done their homework.

stonethrowaway | a month ago

See https://en.wikipedia.org/wiki/Automated_Mathematician for an early system that seems similar in some way.

PaulHoule | a month ago

To what extent is the training and structure of AlphaProof tailored specifically to IMO-type problems, which typically have short solutions using combinations of a small handful of specific techniques?

(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)

nybsjytm | a month ago

So they weren't able to solve the combinatorics problem. I'm not super well versed in competition math, but combinatorics always seem to be the most interesting problems to me.

osti | a month ago

Remember when people thought computers would never be able to beat a human Grand Master at chess? Ohhh, pre-2000 life, how I miss thee.

lo_fye | a month ago

I honestly expected the IOI (International Olympiad of Informatics) to be "beaten" much earlier than the IMO. There's AlphaCode, of course, but on the latest update I don't think it was quite on "silver medal" level. And available LLM's are probably not even on "honourable mention" level.

I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).

Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.

[1] https://info.atcoder.jp/entry/llm-abc-rules-en

quirino | a month ago

There doesn’t seem to be much information on how they attempted and failed to solve the combinatorial type problems.

Anyone know any details?

ckcheng | a month ago

Brilliant and so encouraging!

>because of limitations in reasoning skills and training data

One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?

pnjunction | a month ago

Is the score of 28 comparable to the score of 29 here? https://www.kaggle.com/competitions/ai-mathematical-olympiad...

jerb | a month ago

Is it really such a smart thing to train a non-human "entity" to beat humans at math?

djaouen | a month ago

Are all of these specialized models available for use? Like, does it have an API?

I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them

brap | a month ago
[deleted]
| a month ago

Can anyone comment on how different the AI generated proofs are when compared to those of humans? Recent chess engines have had some 'different' ideas.

11101010001100 | a month ago

If the system took 3 days to solve a problem, how different is this approach than a bruteforce attempt at the problem with educated guesses? Thats not reasoning in my mind.

imranhou | a month ago

Why is it so hard to make an AI that can translate an informally specified math problem (and Geometry isn't even so informal) into a formal representation?

gowld | a month ago

Why frontier models can both achieve silver medal in Math Olympiad but also fail to answer "which number is bigger, 9.11 or 9.9"?

sssummer | a month ago

It’s as impressive as if not more than AI beating a chess master. But are we or should we be really impressed?

quantum_state | a month ago

Is this just google blowing up their own asses or is this actually useable with some sane license?

rowanG077 | a month ago

That's great, but does that particular model also know if/when/that it does not know?

c0l0 | a month ago

Oh, the title was changed to international math Olympiad. I was reading IMO as in my opinion, haha

atum47 | a month ago

6 months ago I predicted Algebra would be next after geometry. Nice to see that was right. I thought number theory would come before combinatorics, but this seems to have solved one of those. Excited to dig into how it was done

https://news.ycombinator.com/item?id=39037512

fovc | a month ago

Can someone explain why proving and math problem solving is not a far easier problem for computers? Why does it require any “artificial intelligence” at all?

For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.

What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?

lumb63 | a month ago

How do they know their formalization of the informal problems into formal ones was correct?

mathinaly | a month ago

Please could someone explain, very simply, what the training data was composed of?

__0x01 | a month ago

Is it one of those slowly slowly then suddenly things? I hope so

m3kw9 | a month ago

Can it / did it solve problems that weren't solved yet?

mupuff1234 | a month ago

How long until this tech is integrated into compilers?

amelius | a month ago

> First, the problems were manually translated into formal mathematical language

That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"

dmitrygr | a month ago
[deleted]
| a month ago

Goalposts at the moon, FUD at "but what if its obviously fake?".

Real, exact, quotes from the top comments at 1 PM EST.

"I want to believe that the computer found it, but I can't find anything to confirm."

"Curing cancer will require new ideas"

"Maybe they used 10% of all of GCP [Google compute]"

refulgentis | a month ago

Why on earth did the "beastie" need the questions translating?

So it failed at the first step (comprehension) and hence I think we can request a better effort next time.

gerdesj | a month ago

This will in a few months change everything forever. Exponential growth incoming soon from Deepmind systems.

badrunaway | a month ago

Like it understands any of it

szundi | a month ago

And yet it thinks 3.11 is greater than 3.9

(probably confused by version numbers)

thoiwer23423 | a month ago

how long before it solves the last two problems?

mik09 | a month ago

This is a fun result for AI, but a very disingenuous way to market it.

IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].

This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.

If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?

Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.

[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf

lolinder | a month ago

Haha, what a dumb tincan (c) somebody on Twitter right now :D

gyudin | a month ago
hendler | a month ago

Good, now use DiffusER on algebra somehow please

machiaweliczny | a month ago

I read this as “In My Opinion” and really thought this about AI dealing with opinionated people. Nope. HN is still safe. For now…

pmcf | a month ago

In other news today calculator solves math problem.

Sparkyte | a month ago
[deleted]
| a month ago

In 1997, machines defeated a World Chess Champion for the first time, using brute-force "dumb search." Critics noted that while "dumb search" worked for chess, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]

In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?

---

[a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

cs702 | a month ago

[dead]

nsjoz | a month ago

[dead]

ezugwuernesttoc | a month ago

> AI solves International Math Olympiad problems at silver medal level

> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.

Why not compare with students who are given 3 days to submit an answer ? /s

hulitu | a month ago

[dead]

bigbacaloa | a month ago

[dead]

lngnmn2 | a month ago

[dead]

ezugwuernesttoc | a month ago

[dead]

khana | a month ago

[dead]

t3estabc | a month ago

[dead]

teresasovinski | a month ago

[flagged]

kerbaupapua | a month ago

[flagged]

ksoajl | a month ago

[flagged]

highcountess | a month ago

HOLY SHIT. It's amazing

Davidzheng | a month ago

It's bullshit. AlphaGeometry can't even solve Pythagoras theorem.

Not opensourcing anything.

This is a dead end on which no further research can be built.

It violates pretty much every principle of incremental improvement on which science is based. It's here just for hype, and the 300+ comments prove it.

data_maan | a month ago

A lot of words to say second place

NoblePublius | a month ago

No benchmarks of any kind?

iamronaldo | a month ago

This means we may need to remove or replace the Olympiad..It has no practical significance..Winners never contributed to any major scientific breakthroughs.

myspeed | a month ago

Billions of dollars spent building this, gW of energy used to train it. And the best it could do is “silver”?

Got to be kidding me. We are fucked

xyst | a month ago

Parlor tricks. Wake me up AI can reliably identify which number is circled at the level of my two year old.

AyyEye | a month ago

I see DeepMind is still playing around with RL + search algorithms, except now it looks like they're using an LLM to generate state candidates.

I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.

dinobones | a month ago

IMO problems aren't fundamentally different from chess or other games, in that the answer is already known.

piombisallow | a month ago

I'm actually not that surprised. Maths Olympiads IME have always been 80% preparation, 20% skill - if not more heavily tuned to preparation. It was all about solving as many problems as possible ahead of the papers, and having a good short term memory. Since Olympiads are for kids, the amount of actual fundamental mathematical theorems required is actually not that great.

Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).

rich_sasha | a month ago

This shows a major gap in AI.

The proofs of these problems aren't interesting. They were already known before the AI started work.

What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".

What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?

gowld | a month ago

What was the total energy consumption required to acheive this result (both training and running)

And, how much CO2 was released into earths atmosphere?

balls187 | a month ago