Some critical issues with the SWE-bench dataset

joshwa | 336 points

Some of the examples in the paper seem to be wrong.

For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.

For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.

Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.

comex | 19 hours ago

> When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.

This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.

modeless | 21 hours ago

I would argue almost every popular benchmark quoted by the big LLM companies is tainted.

OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.

They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

bearjaws | 21 hours ago

There’s a few things I’m not understanding here.

1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.

3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

ukFxqnLa2sBSBf6 | 21 hours ago

So what we need is something like a versioned crowdsourced coding LLM eval dataset.

Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.

This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

semi-extrinsic | 21 hours ago

Submitted title was "SWE-Bench tainted by answer leakage; real pass rates significantly lower". Normally we'd replace that with the article title, in keeping with the site guideline ("Please use the original title, unless it is misleading or linkbait; don't editorialize."), but in this case the article title is so generic that this is arguably misleading as well, so I took a representative phrase from the abstract instead. That's preferable, because it's better to use the authors' own representation of their article.

If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.

https://news.ycombinator.com/newsguidelines.html

dang | 19 hours ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

huac | 21 hours ago

The solution moving forward has to be private benchmark suites. I could see teams investing in their own set of programming challenges and periodically re-evaluating them - similar to how we would construct sets of live interview questions for candidates and qualitatively assess their ability.

It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.

Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.

perrygeo | 19 hours ago

My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase).

brap | 21 hours ago

Something weird (or at least uncommon) that has caught my attention and I havent seen mentioned in the comments is that they cite the swe-bench paper author by first name in the abstract, Carlos et al, and then by last name (as it is usually done) in the paper, Jimenez et al.

alalv | 8 hours ago

There's a serious issue with benchmarks.

Instead of resolving it, some leaders are further complicating their meaning

Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".

MattDaEskimo | 21 hours ago

To quote Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.

Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.

1024core | 19 hours ago

> solutions were directly provided in the issue report or the comments

This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.

OldGreenYodaGPT | 20 hours ago

I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really.

ionwake | 20 hours ago

I found that this paper was submitted to ICLR, but got rejected: https://openreview.net/forum?id=pwIGnH2LHJ

To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.

shayanh | 20 hours ago

I am shocked—shocked—when a vendor cheats in order to increase their benchmark scores.

I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.

otterley | 21 hours ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Is this what Hofstadter means by a strange-loop?

acc_297 | a day ago

You need benchmarks with the following three properties:

1) No known solutions, so there's no "ground truth" dataset to train on

2) Presumably hard to solve

3) But easy to verify a solution if one is provided.

This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?

optimalsolver | 21 hours ago
[deleted]
| 19 hours ago

Paper from October 2024

htrp | 20 hours ago