Paper breakdown · AI & Software Engineering

What I learned reading SWE-bench

Can AI actually fix real bugs? Researchers put language models to the test, and the results were humbling.

Published on my learning blog · ICLR 2024 paper by Jimenez et al., Princeton

Most coding assignments in our first-year programming courses (like EECS 1011 or EECS 1021) give us self-contained, bite-sized tasks. For example, write a function that reverses a list, or solve a simple algorithm challenge.

But real-world software engineering is nothing like that. A bug in a real system isn't neatly isolated. It might be hiding somewhere across a dozen files in a codebase (a massive, interconnected folder structure) containing over 400,000 lines of code. To fix it, you have to understand how different modules, classes, and libraries interact.

This paper introduces SWE-bench, a benchmark designed to test AI models on 2,294 real GitHub issues pulled from popular Python open-source projects like Django, scikit-learn, and matplotlib. Instead of toy problems, the AI is given a raw, real bug report and the entire codebase. Its job is to find the bug, write a fix, and see if it passes the actual unit tests (automated scripts that verify if code works, just like the grading scripts that check our university lab submissions).

"The best-performing model at the time, Claude 2, was only able to resolve 1.96% of the issues."

The researchers didn't write these tasks by hand. They built an automated system that scanned GitHub for real pull requests. They only kept the ones that closed a real, reported bug, included the actual developer's fix, and added new unit tests to make sure the bug wouldn't happen again. After all that filtering, 90,000 PRs became 2,294 high-quality tasks, simulating a real day in the life of a software engineer.

To put this into perspective, a typical first-year university coding project might have 3 or 4 files and a few hundred lines of code. On SWE-bench, the typical task has:

3,010 files on average in the folder.
438,000 lines of code across the entire codebase.
The AI is expected to output a correctly formatted patch file (a special text file that lists exactly which lines of code to add, delete, or edit) to fix the bug, all without any human telling it where to look.
The actual human developer's fix (the "gold standard" fix) usually only edits an average of 1.7 files across about 3 functions.

Model performance

Claude 2

1.96%

SWE-Llama 13b

0.70%

SWE-Llama 7b

0.70%

ChatGPT-3.5

0.17%

GPT-4

0.00%

To give models a fighting chance, researchers used a search-engine technique called BM25 retrieval, which is like a smart search tool that scans the codebase to find and feed only the most relevant files to the AI. Even with that help, the results were incredibly humbling: the best model resolved fewer than 2% of the issues.

For computer engineering students, this is actually incredibly exciting news. It shows that while AI is great at autocomplete and writing simple scripts, it's nowhere near replacing real engineers who can navigate complex systems.

Key findings that surprised me

More code actually confused the AI: You would think giving the AI the entire codebase would help it. But models actually performed worse with larger context windows. Too much code acted like "noise", distracting the AI from the actual bug.
AI fixes are too simple: Generated fixes added or removed far fewer lines than the real human fixes. The AI would solve the surface bug but miss edge cases or fail to follow the project's code style.
Age of the bug didn't matter: The AI did just as badly on brand new bugs as it did on older ones. This proves the models wasn't just "memorizing" solutions from their training data.
No vision, no fix: Many real bugs are reported using screenshots or system diagrams. Text-only AI models had zero chance of solving these because they lacked multimodal reasoning (the ability to see and understand images alongside text).

Reading this paper changed how I think about AI coding tools. Tools like GitHub Copilot or ChatGPT are amazing study buddies for explaining a concept or writing boilerplate code, but they are far from autonomous developers who can safely edit a production codebase.

The researchers also built SWE-Llama, a fine-tuned open-source model for this specific task. It's competitive with Claude 2 despite being much smaller, showing that fine-tuning a model on the right data is often more powerful than just making the model bigger. And the benchmark is self-updating: you can always pull fresh GitHub issues that postdate any model's training cutoff, so it won't go stale the way most benchmarks do.

Ultimately, SWE-bench shows us that the hard part of software engineering isn't typing out the syntax, it's understanding the architecture, finding where things go wrong, and designing clean solutions. That's exactly what we are in school to learn!

Paper: "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" by Jimenez et al., ICLR 2024. Dataset and leaderboard at swebench.com.