Paper breakdown  ·  AI & Software Engineering

What I learned reading SWE-bench

Can AI actually fix real bugs? Researchers put language models to the test — and the results were humbling.

Published on my learning blog  ·  ICLR 2024 paper by Jimenez et al., Princeton

Most AI coding benchmarks give models self-contained puzzle problems — write a function that reverses a list, solve this algorithm challenge. But real software engineering is nothing like that. A real bug might be hiding across a dozen files in a 400,000-line codebase, and fixing it requires understanding how everything fits together.

This paper introduces SWE-bench — a benchmark built from 2,294 real GitHub issues across 12 popular Python projects like Django, scikit-learn, and matplotlib. Instead of toy problems, models get an actual bug report and the full codebase, and are scored by whether their fix passes real unit tests.

"The best-performing model, Claude 2, was only able to resolve 1.96% of the issues."

The researchers didn't create problems by hand — they scraped GitHub automatically using a 3-stage pipeline: first collecting around 90,000 pull requests, then filtering down to only those that closed a real issue and added tests, and finally running those tests to confirm each fix actually worked. After all that filtering, 90,000 PRs became 2,294 high-quality tasks.

These aren't small codebases either. A typical task involves a codebase with around 3,010 files and 438,000 lines of code. The model is expected to find the right files, understand how they interact, and produce a correctly formatted patch — all without being told where to look. The gold-standard fixes touch an average of 1.7 files across about 3 functions.

Model performance

Claude 2
1.96%
SWE-Llama 13b
0.70%
SWE-Llama 7b
0.70%
ChatGPT-3.5
0.17%
GPT-4
0.00%

These numbers are low — intentionally so. That's the whole point: to expose the real gap between what AI can do on toy problems versus real engineering. Using BM25 retrieval to feed relevant files as context, the best model resolved fewer than 2% of issues.

A few findings surprised me. First, more context didn't help — models actually did worse with larger context windows. Extra code distracted them rather than helping them localize the bug. Second, generated patches were consistently simpler than the gold fixes, adding and removing far fewer lines. Models solved the surface problem but missed edge cases and style consistency. Third, performance on older versus newer bugs was nearly identical, which rules out the theory that models are just "remembering" solutions from training data. And finally, bugs that relied on screenshots to explain were simply out of reach for text-only models — a real gap in multimodal reasoning.

Reading this paper changed how I think about AI coding tools. Tools like GitHub Copilot are great at completing functions or writing boilerplate, but SWE-bench shows they're nowhere near being able to autonomously navigate and fix a production codebase.

The researchers also built SWE-Llama, a fine-tuned open-source model for this specific task. It's competitive with Claude 2 despite being much smaller — showing that fine-tuning on the right data matters enormously. And the benchmark is self-updating: you can always pull fresh GitHub issues that postdate any model's training cutoff, so it won't go stale the way most benchmarks do.

Paper: "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" — Jimenez et al., ICLR 2024. Dataset and leaderboard at swebench.com.