>GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either.
I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.
Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.
In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).
Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
My take: we have no clue how this works and the performance can be down tomorrow just as well.
This is the PR with the changes in case people missed it:
The only problem is I feel like having to have Claude rewrite the prompt negates some of the efficiency and latency benefits of using mini. For system prompts obviously this doesn't matter, but for actual continuous user interaction, it feels unworkable.
It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.
Overall an awesome article!
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
My experience as well.
Prompt changes affect output substantially (just look up arxiv), the difficult part is find an optimal structure to yield the best results. It is a bit expensive to do a lot of testing on your own, so it all boils down to feels and experience at the moment. Then you mix up tool calls, other agent calls, client functions and this gets terribly hard to evaluate.
I am still puzzled how distance between policies can have an effect on the output. And how a simple retry fixes everything.
Rewriting prompts don't come with no costs. The cost here is that different prompts work for different contexts and is not generalisable. The rewritten prompt here will not work well for other cases like medical or social advice.
I think this rewriting of prompts technique is the reason "reasoning" models perform well - they know exactly how to rewrite the prompts for a context.
FWIW I don't trust these benchmarks fully because a huge bump like this is not expected - I would expect OpenAI to optimise enough to let such gaps open.
I feel like eventually we’ll get LLMs that will act like compilers do now. So they will take a prompt and turn it into an optimized prompt for a bigger LLM.
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
> Removed verbose explanations mixed with instructions
Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.
Doesn't saying "check -> action" suggest you're taking _away_ the agentic capabilities, and optimizing for the benchmark, meaning it's no longer a good benchmark for agentic capabilities?
That's like being able to see the test before taking it
Copilot in VSCode seems to do something similar in the form of todo lists.
Have you tried to use gpt-5 with high thinking to rewrite the prompt? why claude for this vs some other model?
No before/after prompt.
Into the trash it goes.
DSPy was ahead of its time and still underutilized.
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
you would also be interested in dSPY...
[dead]
Here is the summary of key improvements made:
1. Structure & Flow
2. AI Agent Optimizations 3. Cognitive Load Reduction 4. Actionable Language