SWE-bench Verified Explained: 2026 Methodology, Tiers, Caveats
The canonical coding-agent benchmark, with the methodology footnotes most coverage omits.
SWE-bench Verified is the canonical benchmark for coding agents as of 2026. Unlike HumanEval (164 toy functions, now saturated) or MBPP (basic Python, also saturated), SWE-bench tests whether an agent can solve real software engineering problems in real codebases. The task is practical and the evaluation is objective: either the failing test passes without breaking anything else, or it does not.
Origin and Construction
SWE-bench was created by Yang, Deng, Ray, Patel, Dziri, Karimzadeh, Bhatt, Gao, Neubig, Koreeda, and Sung (Princeton, Stanford, University of Chicago), published November 2023. The benchmark draws from 2,294 real GitHub issues across 12 popular Python repositories: Django, SymPy, Astropy, matplotlib, scikit-learn, pytest, requests, Flask, Pillow, networkx, sphinx-doc, and seaborn.
Each task is a real bug report or feature request with an associated failing test. The benchmark records the repository state at the time the issue was open, the issue description, and the test that was added to verify the fix. The agent's job is to produce a git diff that makes the test pass without breaking anything already passing.
The repositories were chosen for diversity of domain (web frameworks, scientific computing, image processing, testing tools, documentation) and for having high-quality test suites that make evaluation reliable. All 12 are actively maintained open-source projects used in production.
SWE-bench vs Lite vs Verified
| Version | Tasks | Notes | Status |
|---|---|---|---|
| SWE-bench | 2,294 | Original. Some tasks unsolvable or under-specified. Scores noisy. | Superseded |
| SWE-bench Lite | 300 | Curated subset, lower noise. Used in 2024 papers. Less authoritative than Verified. | Legacy |
| SWE-bench Verified | 500 | Human-verified by OpenAI (Aug 2024). Each task confirmed solvable and clear. Canonical. | Use this |
When a paper or model card says “SWE-bench” without specifying which version, ask which. An 80% on SWE-bench Lite is not comparable to a low-70s on Verified.
SOTA Progression Timeline
| Date | Frontier Tier | Note |
|---|---|---|
| Nov 2023 | Initial paper baseline (around 2%) | Original SWE-bench, before harnesses matured. |
| Mar 2024 | Frontier model native (mid-single digits) | First serious model-card numbers. |
| May 2024 | Strong harness + frontier model (low teens) | Princeton SWE-agent harness improves access. |
| Oct 2024 | Frontier mid-30s | First step-change after Verified subset launch. |
| Dec 2024 | Frontier near 50% | Agentic harness, extended tools. |
| Mar 2025 | Frontier mid-60s | Mainstream model-card claim for top tier. |
| Sep 2025 | Frontier above 70% | First models cross 70%. |
| Apr 2026 | Frontier low-to-mid 70s | Captured from official SWE-bench leaderboard. |
Pre-August 2024 entries are on original SWE-bench or Lite and are not directly comparable to Verified. Tiers reported instead of single numbers because frontier scores move week to week and depend heavily on harness.
How It Is Actually Scored
The evaluation harness gives the agent access to a Docker container running the target repository at the exact commit before the fix. The agent has access to the repository files, the issue description, and a test runner. It is not given the failing test directly; it must discover the test, understand the issue, and produce a patch.
Success requires two things simultaneously. First, all fail_to_pass tests (the tests that were failing before the fix) must now pass. Second, all pass_to_pass tests (tests that were already passing) must still pass. This prevents trivially invalid patches; you cannot delete the failing test or disable all tests to score a pass.
The evaluation does not check whether the patch is good code, only whether it passes tests. This is both the benchmark's strength (objective, reproducible) and its limitation: a test suite is not a complete specification of correct behaviour.
Known Limitations
Python only
All 12 repositories are Python. Performance on JavaScript, Go, Rust, or TypeScript codebases is unknown and potentially very different.
Test flakiness
Some tests pass or fail non-deterministically due to timing, resource availability, or randomness. Flaky tests inflate or deflate scores depending on the run.
Contamination risk
These are real public issues. Their solutions exist in the same repository's git history. Any model trained after the issue was closed may have seen the solution.
Production gap
A patch that passes tests is not necessarily production-ready. It might introduce security vulnerabilities, performance regressions, or violate style conventions that tests do not cover.
How to Interpret a SWE-bench Score
Reader Questions
Q.01Is SWE-bench the best coding benchmark?+
Q.02What is the difference between SWE-bench and HumanEval?+
Q.03Can small models do well on SWE-bench Verified?+
Q.04Is SWE-bench gameable?+
Q.05What agentic harness should I compare?+
Sources
- [1] Yang et al., SWE-bench · arxiv.org/abs/2310.06770 · 2023
- [2] SWE-bench Verified Leaderboard · swebench.com · Captured April 2026
- [3] Anthropic Model Cards (referenced for tier ranges) · Captured April 2026
- [4] OpenAI Model Cards (referenced for tier ranges) · Captured April 2026
- [5] SWE-agent GitHub · github.com/princeton-nlp/SWE-agent