OpenAI O3 breakthrough high score on ARC-AGI-PUB
Efficiency is now key.
~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.
We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)
Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
Human performance is 85% [1]. o3 high gets 87.5%.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
I have a very naive question.
Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.
What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
There are new research where chain of thoughts is happening in latent spaces and not in English. They demonstrated better results since language is not as expressive as those concepts that can be represented in the layers before decoder. I wonder if o3 is doing that?
Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)
Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.
I think this is a mistake.
Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.
Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?
There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.
So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.
Direct quote from the ARC-AGI blog:
“SO IS IT AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.
Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…
How do the organisers keep the private test set private? Does openAI hand them the model for testing?
If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?
(I am sure I am missing something.)
Sad to see everyone so focused on compute expense during this massive breakthrough. GPT-2 originally cost $50k to train, but now can be trained for ~$150.
The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
I’m not sure if people realize what a weird test this is. They’re these simple visual puzzles that people can usually solve at a glance, but for the LLMs, they’re converted into a json format, and then the LLMs have to reconstruct the 2D visual scene from the json and pick up the patterns.
If humans were given the json as input rather than the images, they’d have a hard time, too.
Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
"Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.
From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:
1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)
2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.
3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.
4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.
5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
I would like to see this repeated with my highly innovative HARC-HAGI, which is ARC-AGI but it uses hexagons instead of squares. I suspect humans would only make slightly more brain farts on HARC-HAGI than ARC-AGI, but O3 would fail very badly since it almost certainly has been specifically trained on squares.
I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.
My initial impression: it's very impressive and very exciting.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
Complete aside here: I used to do work with amputees and prosthetics. There is a standardized test (and I just cannot remember the name) that fits in a briefcase. It's used for measuring the level of damage to the upper limbs and for prosthetic grading.
Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.
We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).
Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
Does anyone have a feeling for how latency (from asking a question/API call to getting an answer/API return) is progressing with new models? I see 1.3 minutes/task and 13.8 minutes/task mentioned in the page on evaluating O3. Efficiency gains that also reduce latency will be important and some of them will come from efficiency in computation, but as models include more and more layers (layers of models for example) the overall latency may grow and faster compute times inside each layer may only help somewhat. This could have large effects on usability.
O3 High (tuned) model scored an 88% at what looks like $6,000/task haha
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
Can I just say what a dick move it was to do this as a 12 days of Christmas. I mean to be honest I agree with the arguments this isn’t as impressive as my initial impression, but they clearly intended it to be shocking/a show of possible AGI, which is rightly scary.
It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.
Thanks again openAI for showing us you don’t give a shit about actual people.
> o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time
I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
Just as an aside, I've personally found o1 to be completely useless for coding.
Sonnet 3.5 remains the king of the hill by quite some margin
As an aside, I'm a little miffed that the benchmark calls out "AGI" in the name, but then heavily cautions that it's necessary but insufficient for AGI.
> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
One thing I have not seen commented on is that ARC-AGI is a visual benchmark but LLMs are primarily text. For instance when I see one of the ARC-AGI puzzles, I have a visual representation in my brain and apply some sort of visual reasoning solve it. I can "see" in my mind's eye the solution to the puzzle. If I didn't have that capability, I don't think I could reason through words how to go about solving it - it would certainly be much more difficult.
I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?
The chart is super misleading, since the test was obscure until recently. A few months ago he announced he'd made the only good AGI test and offered a cash prize for solving it, only to find out in as much time that it's no different from other benchmarks.
But can it convert handwritten equations into Latex? That is the AGI task I'm waiting for.
I'm 22 and have no clue what I'm meant to do in a world where this is a thing. I'm moving to a semi rural, outdoorsy area where they teach data science and marine science and I can enjoy my days hiking, and the march of technology is a little slower. I know this will disrupt so much of our way of life, so I'm chasing what fun innocent years are left before things change dramatically.
Deciphering patterns in natural language is more complex than these puzzles. If you train your AI to solve these puzzles, we end up in the same spot. The difficulty of solving would be with creating training data for a foreign medium. The "tokens" are the grids and squares instead of words (for words, we have the internet of words, solving that).
If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!
The cost axis is interesting. O3 Low is $10+ per task and 03 High is over $1000 (it's logarithmic graph so it's like $50 and $5000 respectively?)
In (1) the author use a technique to improve the performance of an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub benchmark moreover he said that more computer power would give better results. So the results of o3 could be produced in this way using the same method with more computer power, so if this is the case the result of o3 is not very interesting.
Isn't this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.
So, next step in reasoning is open world reasoning now?
At about 12-14 minutes in OpenAI's YouTube vid they show that o3-mini beats o1 on Codeforces despite using much less compute.
I was impressed until I read the caveat about the high-compute version using 172x more compute.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
It sucks that I would love to be excited about this... but I mostly feel anxiety and sadness.
A lot of the comments seem very dismissive and a little overly-skeptical in my opinion. Why is this?
I wonder: when did o1 finish training, and when did o3 finish training?
There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.
How can there be "private" taks when you have use the OpenAI API to run queries? OpenAI sees everything.
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
I feel like AI is already changing how we work and live - I've been using it myself for a lot of my development work. Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close) than humans can. We're talking about a huge shift where first knowledge workers get automated, then physical work too. The thing is, our whole society is built around people working to earn money, so what happens when AI can do most jobs? It's not just about losing jobs - it's about how people will pay for basic stuff like food and housing, and what they'll do with their lives when work isn't really a thing anymore. Or do people feel like there will be jobs safe from AI? (hopefully also fulfilling)
Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
The more Hacker News worthy discussion is the part where the author talks about search through the possible mini-program space of LLMs.
It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
Interesting that in the video, there is an admission that they have been targeting this benchmark. A comment that was quickly shut down by Sam.
A bit puzzling to me. Why does it matter ?
It seems O3 following trend of Chess engine that you can cut your search depth depends on state.
It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.
This is a lot of noise around what's clearly not even an order of magnitude to the way to AGI.
Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?
That's what a human-level performer would do.
Guys, its already happening. I recently got laid off due to AI taking over my jobs.
The general message here seems to be that inference-time brute-forcing works as long as you have a good search and evaluation strategy. We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space. It feels like a scripting problem now. Which is cool! A fun space for hacker-engineers. Also:
> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
I found this such an intriguing way of thinking about it.
If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app
Maybe spend more compute time to let it think about optimizing the compute time.
o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search
> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
This might sound dumb, and I'm not sure how to phrase this, but is there a way to measure the raw model output quality without all the more "traditional" engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?
(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)
So article seriously and scientifically states:
"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."
Hurray ! Put limited version of that on everybody phones !
a little from column A, a little from column B
I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
We need to start making benchmarks in memory & continued processing over a task over multiple days, handoffs, etc (ie. 'agentic' behavior). Not sure how possible this is.
The examples unsolved by high compute o3 look a lot like the raven progressive matrix tests used in IQ tests.
I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.
These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can't wait to see how o3 performs comparatively in real-world scenarios.
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?
Terrifying. This news makes me happy I save all my money. My only hope for the future is that I can retire early before I’m unemployable
Headline could also just be OpenAI discovers exponential scaling wall for inference time compute.
We're speaking recently a lot about ecology. I wonder how much CO2 is emitted during such a task, as additional cost to the cloud. I'm concerned, because greedy companies will happily replace humans with AI and they will probably plant a few trees to show how they care. But energy does not come from the sun, at least not always and not everywhere... And speaking with AI customer specialist that is motivated to reject my healthcare bills, working for my insurance company is one of the darkest future views...
I pay for lots of models, but Claude Sonnet is the one I use most. ChatGPT is my quick tool for short Q&As because it’s got a desktop app. Even Google‘s new offerings did not lure me away from Claude which I use daily for hours via a Teams plan with five seats.
Now I am wondering what Anthropic will come up with. Exciting times.
What are the differences between the public offering and o3? What is o3 doing differently? Is it something akin to more internal iterations, similar to „brute forcing” a problem, like you can yourself with a cheaper model, providing additional hints after each response?
Does anyone have prompts they like to use to test the quality of new models?
Please share. I’m compiling a list.
AGI ⇒ ARC-AGI-PUB
And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
Based on the chart, the Kaggle SOTA model is far more impressive. These O3 models are more expensive to run than just hiring a mechanical turk worker. It's nice we are proving out the scaling hypothesis further, it's just grossly inelegant.
The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
Interesting about the cost:
> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
The real breakthrough is the 25% on Frontier Math.
For what it's worth, I'm much more impressed with the frontier math score.
Many are incorrectly citing 85% as human-level performance.
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
I guess I get to brag now. ARC AGI has no real defences against Big Data, memorisation-based approaches like LLMs. I told you so:
https://news.ycombinator.com/item?id=42344336
And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:
[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
Not that I don't think costs will dramatically decrease, but the $1000 cost per task just seems to be per one problem on ARC-AGI. If so, I'd imagine extrapolating that to generating a useful midsized patch would be like 5-10x
But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation
When the source code for these LLMs gets leaked, I expect to see:
def letter_count(string, letter):
if string == “strawberry” and letter == “r”:
return 3
…
Wouldn't one then built the analog of the lisp computer to hyper optimize just this. Like it might be super expensive for regular gpus but for super specialized architecture one could shave the 3500$/hour quite a bit no?
Humans can take the test here to see what the questions are like: https://arcprize.org/play
If I'm reading that chart right that means still log scaling & we should still be good with "throw more power" at it for a while?
This is insanely expensive to run though. Looks like it cost around $1 million of compute to get that result.
Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
At what time will it kill us all because it understands that humans are the biggest problem before it can simply chill and not worry.
That would be intelligent. Everything else is just stupid and more of the same shit.
It's certainly remarkable, but let's not ignore the fact that it still fails on puzzles that are trivial for humans. Something is amiss.
I'm glad these stats show a better estimate of human ability than just the average mturker. The graph here has the average mturker performance as well as a STEM grad measurement. Stuff like that is why we're always feeling weird that these things supposedly outperform humans while still sucking. I'm glad to see 'human performance' benchmarked with more variety (attention, time, education, etc).
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
Contrary to many I hope this stays expensive. We are already struggling with AI curated info bubbles and psy-ops as it is.
State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
Why would they give a cost estimate per task on their low compute mode but not their high mode?
"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval
The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval
Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)
It’s not AGI when it can do 1000 math puzzles. It’s AGI when it can do 1000 math puzzles then come and clean my kitchen.
Seriously, programming as a profession will end soon. Let's not kid us anymore. Time to jump the ship.
The LLM community has come up with tests they call 'Misguided Attention'[1] where they prompt the LLM with a slightly altered version of common riddles / tests etc. This often causes the LLM to fail.
For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.
The graph seems to indicate a new high in cost per task. It looks like they came in somewhere around $5000/task, but the log scale has too few markers to be sure.
That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.
(1995 called. It wants its web design back.)
I really like that they include reference levels for an average STEM grad and an average worker for Mechanical Turk. So for $350k worth of compute you can have slightly better performance than a menial wage worker, but slightly worse performance than a college grad. Right now humans win on value, but AI is catching up.
How does o3 know when to stop reasoning?
Their discussion contains an interesting aside:
> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.
I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.
Wondering what are author's thoughts on the future of this approach to benchmarking? Completing super hard tasks while then failing on 'easy' (for humans) ones might signal measuring the wrong thing, similar to Turing test.
fun! the benchmarks are so interesting because real world use is so variable. sometimes 4o will nail a pretty difficult problem, other times o1 pro mode will fail 10 times on what i would think is a pretty easy programming problem and i waste more time trying to do it with ai
i'm surprised there even is a training dataset. Wasn't the whole point to test whether models could show proof of original reasoning beyond patterns recognition ?
Okay but what are the tests like? At least like a general idea.
Very convenient for OpenAI to run those errands with bunch of misanthropes trying to repaint a simulacrum. To use AGI here's makes me want to sponsor pile of distress pills so people think things really over before going into another mania Episode. People need seriously take a step back, if that's AGI then my cat has surpassed it's cognitive acting twice.
Intelligence comes in many forms and flavors. ARC prize questions are just one version of it -- perhaps measuring more human-like pattern recognition than true intelligence.
Can machines be more human-like in their pattern recognition? O3 met this need today.
While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.
What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
We should NOT give up on scaling pretraining just yet!
I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.
Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.
I fear text pairings with CLIP or OCR constrain a model too much and confuse
The result on Epoch AI Frontier Math benchmark is quite a leap. Pretty sure most people couldn’t even approach these problems, unlike ARC AGI
I just graduated college, and this was a major blow. I studied Mechanical Engineering and went into Sales Engineering because cause I love technology and people, but articles like this do nothing but make me dread the future.
I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.
Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
I just want it to do my laundry.
Besides higher scores - is there any improvements for a general use? Like asking to help setup home assistant etc etc?
What is the cost of "general intelligence"? What is the price?
All those saying "AGI", read the article and especially the section "So is it AGI?"
Don't be put off by the reported high-cost
Make it possible->Make it fast->Make it Cheap
the eternal cycle of software.
Make no mistake - we are on the verge of the next era of change.
I’m confused about the excitement. Are people just flat out ignoring the sentences below? I don’t see any breakthrough towards AGI here. I see a model doing great in another AI test but about to abysmally fail a variation of it that will come out soon. Also, aren’t these comparisons completely nonsense considering it’s o3 tuned vs other non-tuned?
> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
Can it play Mario 64 now?
> verified easy for humans, harder for AI
Isn’t that the premise behind the CAPTCHA?
FYI: Codeforces competitive programming scores (basically only) by time needed until valid solutions are posted
https://codeforces.com/blog/entry/133094
That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.
When is this available? Which plans can use it?
I bet it still thinks 1+1=3 if it read enough sources parroting that.
I wish there was a way to see all the attempts it got right graphically like they show the incorrect ones.
Kinda expensive though.
So o1 pro is CoT RL and o3 adds search?
Denoting it in $ for efficiency is peak capitalism, cmv.
Did they just skip o2?
Congratulations
These tests are meaningless until You show them doing mundane tasks
Just curious, I know o1 is a model OpenAI offers. I have never heard of the o3 model. How does it differ from o1?
Never underestimate a droid
AGI for me is something I can give a new project to and be able to use it better than me. And not because it has a huge context window, because it will update its weights after consuming that project. Until we have that I don't believe we have truly reached AGI.
Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.
This is actually mindblowing!
Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
How to invest in this stonk market
Is it just me or does looking at the ARC-AGI example questions at the bottom... make your brain hurt?
There should be a benchmark that tells the AI it's previous answer was wrong and test the number of times it either corrects itself or incorrectly capitulates, since it seems easy to trip them up when they are in fact right.
This was a surprisingly insightful blog post, going far beyond just announcing the o3 results.
Uhh...some of us are apparently living under a rock, as this is the first time I hear about o3 and I'm on HN far too much every day
Someone asked if true intelligence requires a foundation of prior knowledge. This is the way I think about it.
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
It's beyond ridiculous how the definition of AGI has shifted from being an AI that's so good it can improve itself entirely independently infinitely to "some token generator that can solve puzzles that kids could solve after burning tens of thousands of dollars".
I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.
Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.
This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.
it's official old buddy, i'm a has been.
The first computers cost millions of dollars and filled entire rooms to accomplish what we would now consider simple computational tasks. That same computing power now fits into the width of a finger nail. I don’t get how technologists balk at the cost of experimental tech or assume current tech will run at the same efficiency for decades to come and melt the planet into a puddle. AGI won’t happen until you can fit enough compute that’d take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time. This is all going to take some time to say the least. Progress is progress.
So now not only are the models closed, but so are their evals?! This is a "semi-private" eval. WTH is that supposed to mean? I'm sure the model is great but I refuse to take their word for it.
So in a few years, coders will be as relevant as cuneiform scribes.
With only a 100x increase in cost, we improved performance by 0.1x and continued plotting this concave-down diminishing-returns type graph! Hurray for logarithmic x-axes!
Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.
It is not exactly AGI but huge step toward it. I would expect this step in 2028-2030. I cant really understand why people are happy with it, this technology is so dangerous that can disrupt whole society. It's neither like smartphone nor internet. What will happen to 3rd world countries. Lots of unsolved questions and world is not prepared for such a change. Lots of people will lose their jobs I am not even mentioning their debts. No one will have chance to be rich anymore, If you are in first world country you will probably get UBI, if not you wont.
This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).
So much for a plateau lol.
> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.
That's the most plausible definition of AGI i've read so far.
This feels like big news to me.
First of all, ARC is definitely an intelligence test for autistic people. I say as someone with a tad of the neurodiversity. That said, I think it's a pretty interesting one, not least because as you go up in the levels, it requires (for a human) a fair amount of lateral thinking and analogy-type thinking, and of course, it requires that this go in and out of visual representation. That said, I think it's a bit funny that most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image. I continue to hope for some poet and painter-derived intelligence tests to be added to the next gen tests we all look at and score.
For those reasons, I've always really liked ARC as a test -- not as some be-all end-all for AGI, but just because I think that the most intriguing areas next for LLMs are in these analogy arenas and ability to hold more cross-domain context together for reasoning and etc.
Prompts that are interesting to play with right now on these terms range from asking multimodal models to say count to ten in a Boston accent, and then propose a regional french accent that's an equivalent and count to ten in that. (To my ear, 4o is unconvincing on this). Similar in my mind is writing and architecting code that crosses multiple languages and APIs, and asking for it to be written in different styles. (claude and o1-pro are .. okay at this, depending).
Anyway. I agree that this looks like a large step change. I'm not sure if the o3 methods here involve the spinning up of clusters of python interpreters to breadth-search for solutions -- a method used to make headway on ARC in the past; if so, this is still big, but I think less exciting than if the stack is close to what we know today, and the compute time is just more introspection / internal beam search type algorithms.
Either way, something had to assess answers and think they were right, and this is a HUGE step forward.
If people constantly have to ask if your test is a measure of AGI, maybe it should be renamed to something else.
How much longer can I get paid $150k to write code ?
[dead]
[dead]
[dead]
Great. Now we have to think of a new way to move the goalposts.
[flagged]
Great results. However, let's all just admit it.
It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.
It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.
So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)
Another meaningless benchmark, another month—it’s like clockwork at this point. No one’s going to remember this in a month; it’s just noise. The real test? It’s not in these flashy metrics or minor improvements. The only thing that actually matters is how fast it can wipe out the layers of middle management and all those pointless, bureaucratic jobs that add zero value.
That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
Maybe I'm missing something vital, but how does anything that we've seen AI do up until this point or explained in this experiment even hint at AGI? Can any of these models ideate? Can they come up with technologies and tools? No and it's unlikely they will any time soon. However, they can make engineers infinitely more productive.
Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
It will be fascinating to see how this unfolds.
Congrats to OAI on yet another fantastic release.
The best AI on this graph costs 50000% more than a stem graduate to complete the tasks and even then has an error rate that is 1000% higher than the humans???
This is so impressive that it brings out the pessimist in me.
Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.
Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.
They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!
Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.