Abstract

What500 real GitHub issues; agent must write the patch that passes the failing test.

WhoYang et al., Princeton + Stanford + U Chicago, 2023.

2026 TierFrontier models in the low-to-mid 70s.

Section II.iii · Agent Benchmarks|Last verified April 2026

SWE-bench Verified Explained: 2026 Methodology, Tiers, Caveats

Q: Can small models do well on SWE-bench Verified?

Small models (7B-13B parameters) currently score below 5% on SWE-bench Verified. The task requires strong code comprehension, multi-file reasoning, and precise edits that smaller models struggle with. Mid-size models in the 30-70B range score in the teens to high twenties with strong harnesses. Frontier models reach the low-to-mid 70s. SWE-bench is one of the clearest cases where model scale strongly predicts task performance.

Q: Is SWE-bench gameable?

Partially. The test suite is the success criterion, but test suites do not cover all intended behaviour. A patch that makes fail_to_pass tests pass by stubbing the tested behaviour would score a pass on SWE-bench but be useless in production. The Verified subset reduces but does not eliminate this risk. The most reliable anti-gaming mechanism is checking whether the patch makes semantic sense, which requires human review.

Q: What agentic harness should I compare?

SWE-bench scores are harness-dependent. The agent's access to tools (file search, test runner, bash, editor) significantly affects results. Always ask: what harness was used? The official SWE-agent harness (from the Princeton team) is the most common. Anthropic and OpenAI use proprietary agentic scaffolding for their highest scores. A score from a custom harness with extended tool access is not comparable to a score from the standard harness.

The canonical coding-agent benchmark, with the methodology footnotes most coverage omits.

SWE-bench Verified is the canonical benchmark for coding agents as of 2026. Unlike HumanEval (164 toy functions, now saturated) or MBPP (basic Python, also saturated), SWE-bench tests whether an agent can solve real software engineering problems in real codebases. The task is practical and the evaluation is objective: either the failing test passes without breaking anything else, or it does not.

Origin and Construction

SWE-bench was created by Yang, Deng, Ray, Patel, Dziri, Karimzadeh, Bhatt, Gao, Neubig, Koreeda, and Sung (Princeton, Stanford, University of Chicago), published November 2023. The benchmark draws from 2,294 real GitHub issues across 12 popular Python repositories: Django, SymPy, Astropy, matplotlib, scikit-learn, pytest, requests, Flask, Pillow, networkx, sphinx-doc, and seaborn.

Each task is a real bug report or feature request with an associated failing test. The benchmark records the repository state at the time the issue was open, the issue description, and the test that was added to verify the fix. The agent's job is to produce a git diff that makes the test pass without breaking anything already passing.

The repositories were chosen for diversity of domain (web frameworks, scientific computing, image processing, testing tools, documentation) and for having high-quality test suites that make evaluation reliable. All 12 are actively maintained open-source projects used in production.

SWE-bench vs Lite vs Verified

Version	Tasks	Notes	Status
SWE-bench	2,294	Original. Some tasks unsolvable or under-specified. Scores noisy.	Superseded
SWE-bench Lite	300	Curated subset, lower noise. Used in 2024 papers. Less authoritative than Verified.	Legacy
SWE-bench Verified	500	Human-verified by OpenAI (Aug 2024). Each task confirmed solvable and clear. Canonical.	Use this

When a paper or model card says “SWE-bench” without specifying which version, ask which. An 80% on SWE-bench Lite is not comparable to a low-70s on Verified.

SOTA Progression Timeline

Date	Frontier Tier	Note
Nov 2023	Initial paper baseline (around 2%)	Original SWE-bench, before harnesses matured.
Mar 2024	Frontier model native (mid-single digits)	First serious model-card numbers.
May 2024	Strong harness + frontier model (low teens)	Princeton SWE-agent harness improves access.
Oct 2024	Frontier mid-30s	First step-change after Verified subset launch.
Dec 2024	Frontier near 50%	Agentic harness, extended tools.
Mar 2025	Frontier mid-60s	Mainstream model-card claim for top tier.
Sep 2025	Frontier above 70%	First models cross 70%.
Apr 2026	Frontier low-to-mid 70s	Captured from official SWE-bench leaderboard.

Pre-August 2024 entries are on original SWE-bench or Lite and are not directly comparable to Verified. Tiers reported instead of single numbers because frontier scores move week to week and depend heavily on harness.

How It Is Actually Scored

The evaluation harness gives the agent access to a Docker container running the target repository at the exact commit before the fix. The agent has access to the repository files, the issue description, and a test runner. It is not given the failing test directly; it must discover the test, understand the issue, and produce a patch.

Success requires two things simultaneously. First, all fail_to_pass tests (the tests that were failing before the fix) must now pass. Second, all pass_to_pass tests (tests that were already passing) must still pass. This prevents trivially invalid patches; you cannot delete the failing test or disable all tests to score a pass.

The evaluation does not check whether the patch is good code, only whether it passes tests. This is both the benchmark's strength (objective, reproducible) and its limitation: a test suite is not a complete specification of correct behaviour.

Known Limitations

Python only

All 12 repositories are Python. Performance on JavaScript, Go, Rust, or TypeScript codebases is unknown and potentially very different.

Test flakiness

Some tests pass or fail non-deterministically due to timing, resource availability, or randomness. Flaky tests inflate or deflate scores depending on the run.

Contamination risk

These are real public issues. Their solutions exist in the same repository's git history. Any model trained after the issue was closed may have seen the solution.

Production gap

A patch that passes tests is not necessarily production-ready. It might introduce security vulnerabilities, performance regressions, or violate style conventions that tests do not cover.

How to Interpret a SWE-bench Score

0 to 10%

Research baseline. Early agent systems, small models, or ablation studies. Not production-relevant.

10 to 30%

Emerging capability. Can handle simple bug fixes in familiar code patterns. Requires heavy human review.

30 to 50%

Serviceable assistant. Useful for routine maintenance tasks. Drafts fixes that often require minor human editing.

50 to 70%

Productive collaborator. Strong coding agent. Handles a wide range of issues. Human review still recommended.

70 to 80%

Frontier 2026. Best available as of April 2026. Still fails on complex multi-file refactors and subtle logic.

Above 80%

Either SOTA or the harness is doing most of the work. Ask: what harness? What tool access? Verified or Lite?

Reader Questions

Q.01Is SWE-bench the best coding benchmark?+

SWE-bench Verified is the most important coding-agent benchmark in 2026 because it measures real engineering tasks. For code-completion benchmarks, LiveCodeBench is more relevant. For pure code generation on isolated functions, HumanEval still exists but is saturated. The right benchmark depends on the task.

Q.02What is the difference between SWE-bench and HumanEval?+

HumanEval tests single-function Python completion given a docstring. SWE-bench Verified tests multi-file patch generation for real GitHub issues. The agent navigates a real repository, understands the issue, writes a patch that may change multiple files, and must not break any existing tests. The engineering complexity gap is enormous.

Q.03Can small models do well on SWE-bench Verified?+

Small models (7B-13B parameters) currently score below 5%. Mid-size models in the 30-70B range score in the teens to high twenties with strong harnesses. Frontier models reach the low-to-mid 70s. SWE-bench is one of the clearest cases where model scale strongly predicts task performance.

Q.04Is SWE-bench gameable?+

Partially. A patch that makes fail_to_pass tests pass by stubbing out the tested behaviour would score a pass on SWE-bench but be useless in production. The Verified subset reduces but does not eliminate this risk. The most reliable anti-gaming mechanism is checking whether the patch makes semantic sense, which requires human review.

Q.05What agentic harness should I compare?+

SWE-bench scores are harness-dependent. The agent's tool access significantly affects results. Always ask: what harness was used? The official SWE-agent harness from the Princeton team is the most common. A score from a custom harness with extended tool access is not comparable to a score from the standard harness.

HumanEval comparison →All agent benchmarks →Build a SWE-bench-style eval →Official SWE-agent repo →

Sources

[1] Yang et al., SWE-bench · arxiv.org/abs/2310.06770 · 2023
[2] SWE-bench Verified Leaderboard · swebench.com · Captured April 2026
[3] Anthropic Model Cards (referenced for tier ranges) · Captured April 2026
[4] OpenAI Model Cards (referenced for tier ranges) · Captured April 2026
[5] SWE-agent GitHub · github.com/princeton-nlp/SWE-agent