SWE-bench Verified Explained - 2026 Scores, Methodology, Caveats
SWE-bench Verified is the canonical benchmark for coding agents as of 2026. Unlike HumanEval (164 toy functions, now saturated) or MBPP (basic Python, also saturated), SWE-bench tests whether an agent can solve real software engineering problems in real codebases. The task is practical and the evaluation is objective: either the failing test passes without breaking anything else, or it does not.
Origin and Construction
SWE-bench was created by Yang, Deng, Ray, Patel, Dziri, Karimzadeh, Bhatt, Gao, Neubig, Koreeda, and Sung (Princeton, Stanford, University of Chicago), published November 2023. The benchmark draws from 2,294 real GitHub issues across 12 popular Python repositories: Django, SymPy, Astropy, matplotlib, scikit-learn, pytest, requests, Flask, Pillow, networkx, sphinx-doc, and seaborn.
Each task is a real bug report or feature request with an associated failing test. The benchmark records the repository state at the time the issue was open, the issue description, and the test that was added to verify the fix. The agent's job is to produce a git diff that makes the test pass without breaking anything already passing.
The repositories were chosen for diversity of domain (web frameworks, scientific computing, image processing, testing tools, documentation) and for having high-quality test suites that make evaluation reliable. All 12 are actively maintained open-source projects used in production.
SWE-bench vs SWE-bench Lite vs SWE-bench Verified
| Version | Tasks | Notes | Status |
|---|---|---|---|
| SWE-bench | 2,294 | Original. Noisy - some tasks unsolvable or under-specified. Scores unstable. | Superseded |
| SWE-bench Lite | 300 | Curated subset, lower noise. Used in 2024 papers. Less authoritative than Verified. | Legacy |
| SWE-bench Verified | 500 | Human-verified by OpenAI (August 2024). Each task confirmed solvable and clear. Canonical. | Use this |
When a paper or model card says "SWE-bench" without specifying which version, ask which. An 80% on SWE-bench Lite is not comparable to 74.5% on Verified.
SOTA Progression Timeline
| Date | Score | Model / System | Source |
|---|---|---|---|
| Nov 2023 | 1.9% | Claude 2 (initial paper) | Yang et al. 2023 |
| Mar 2024 | 4.8% | Claude 3 Opus | Anthropic model card |
| May 2024 | 12.5% | SWE-agent + GPT-4o | Princeton SWE-agent |
| Oct 2024 | 33.2% | Claude 3.5 Sonnet | Anthropic model card |
| Dec 2024 | 49.0% | Claude 3.5 Sonnet (agentic harness) | Anthropic extended eval |
| Mar 2025 | 64.6% | Claude 4 Sonnet | Anthropic model card |
| Sep 2025 | 71.3% | GPT-5 | OpenAI model card |
| Apr 2026 | 74.5% | Claude 4.5 Opus | Official SWE-bench leaderboard |
All scores on SWE-bench Verified (500 tasks) unless otherwise noted. Scores pre-August 2024 are on original SWE-bench or Lite and are not directly comparable.
How It Is Actually Scored
The evaluation harness gives the agent access to a Docker container running the target repository at the exact commit before the fix. The agent has access to the repository files, the issue description, and a test runner. It is not given the failing test directly - it must discover the test, understand the issue, and produce a patch.
Success requires two things simultaneously: all fail_to_pass tests (the tests that were failing before the fix) must now pass, AND all pass_to_pass tests (tests that were already passing) must still pass. This prevents trivially invalid patches - you cannot delete the failing test or disable all tests to score a pass.
The evaluation does not check whether the patch is good code - only whether it passes tests. This is both the benchmark's strength (objective, reproducible) and its limitation (a test suite is not a complete specification of correct behavior).
Known Limitations
Python only
All 12 repositories are Python. Performance on JavaScript, Go, Rust, or TypeScript codebases is unknown and potentially very different.
Test flakiness
Some tests pass or fail non-deterministically due to timing, resource availability, or randomness. Flaky tests inflate or deflate scores depending on the run.
Contamination risk
These are real public issues. Their solutions exist in the same repository's git history. Any model trained after the issue was closed may have seen the solution.
Production gap
A patch that passes tests is not necessarily production-ready. It might introduce security vulnerabilities, performance regressions, or violate style conventions that tests do not cover.
How to Interpret a SWE-bench Score
Frequently Asked Questions
Is SWE-bench the best coding benchmark?+
What is the difference between SWE-bench and HumanEval?+
Can small models do well on SWE-bench Verified?+
Is SWE-bench gameable?+
What agentic harness should I compare?+
Sources
- [1] Yang et al., SWE-bench - arxiv.org/abs/2310.06770 - 2023
- [2] SWE-bench Verified Leaderboard - swebench.com - Captured April 2026
- [3] Anthropic Claude 4.5 Model Card - Captured April 2026
- [4] OpenAI GPT-5 Model Card - Captured April 2026
- [5] SWE-agent GitHub - github.com/princeton-nlp/SWE-agent