> When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.
This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.
I would argue almost every popular benchmark quoted by the big LLM companies is tainted.
OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.
They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.
There’s a few things I’m not understanding here.
1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?
2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.
3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.
So what we need is something like a versioned crowdsourced coding LLM eval dataset.
Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.
This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.
Submitted title was "SWE-Bench tainted by answer leakage; real pass rates significantly lower". Normally we'd replace that with the article title, in keeping with the site guideline ("Please use the original title, unless it is misleading or linkbait; don't editorialize."), but in this case the article title is so generic that this is arguably misleading as well, so I took a representative phrase from the abstract instead. That's preferable, because it's better to use the authors' own representation of their article.
If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.
> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
The solution moving forward has to be private benchmark suites. I could see teams investing in their own set of programming challenges and periodically re-evaluating them - similar to how we would construct sets of live interview questions for candidates and qualitatively assess their ability.
It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.
Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.
My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase).
Something weird (or at least uncommon) that has caught my attention and I havent seen mentioned in the comments is that they cite the swe-bench paper author by first name in the abstract, Carlos et al, and then by last name (as it is usually done) in the paper, Jimenez et al.
There's a serious issue with benchmarks.
Instead of resolving it, some leaders are further complicating their meaning
Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".
To quote Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.
> solutions were directly provided in the issue report or the comments
This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.
I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really.
I found that this paper was submitted to ICLR, but got rejected: https://openreview.net/forum?id=pwIGnH2LHJ
To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.
I am shocked—shocked—when a vendor cheats in order to increase their benchmark scores.
I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.
> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.
Is this what Hofstadter means by a strange-loop?
You need benchmarks with the following three properties:
1) No known solutions, so there's no "ground truth" dataset to train on
2) Presumably hard to solve
3) But easy to verify a solution if one is provided.
This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?
Paper from October 2024
Some of the examples in the paper seem to be wrong.
For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.
For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.
Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.