Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
TL;DR
What: 500 real GitHub issues, agent must write a patch that passes tests
Who: Yang et al., Princeton + Stanford + U Chicago, 2023
SOTA (Apr 2026): 74.5% - Claude 4.5 Opus
Leaderboard: swebench.com
Last verified April 2026

SWE-bench Verified Explained - 2026 Scores, Methodology, Caveats

SWE-bench Verified is the canonical benchmark for coding agents as of 2026. Unlike HumanEval (164 toy functions, now saturated) or MBPP (basic Python, also saturated), SWE-bench tests whether an agent can solve real software engineering problems in real codebases. The task is practical and the evaluation is objective: either the failing test passes without breaking anything else, or it does not.

Origin and Construction

SWE-bench was created by Yang, Deng, Ray, Patel, Dziri, Karimzadeh, Bhatt, Gao, Neubig, Koreeda, and Sung (Princeton, Stanford, University of Chicago), published November 2023. The benchmark draws from 2,294 real GitHub issues across 12 popular Python repositories: Django, SymPy, Astropy, matplotlib, scikit-learn, pytest, requests, Flask, Pillow, networkx, sphinx-doc, and seaborn.

Each task is a real bug report or feature request with an associated failing test. The benchmark records the repository state at the time the issue was open, the issue description, and the test that was added to verify the fix. The agent's job is to produce a git diff that makes the test pass without breaking anything already passing.

The repositories were chosen for diversity of domain (web frameworks, scientific computing, image processing, testing tools, documentation) and for having high-quality test suites that make evaluation reliable. All 12 are actively maintained open-source projects used in production.

SWE-bench vs SWE-bench Lite vs SWE-bench Verified

VersionTasksNotesStatus
SWE-bench2,294Original. Noisy - some tasks unsolvable or under-specified. Scores unstable.Superseded
SWE-bench Lite300Curated subset, lower noise. Used in 2024 papers. Less authoritative than Verified.Legacy
SWE-bench Verified500Human-verified by OpenAI (August 2024). Each task confirmed solvable and clear. Canonical.Use this

When a paper or model card says "SWE-bench" without specifying which version, ask which. An 80% on SWE-bench Lite is not comparable to 74.5% on Verified.

SOTA Progression Timeline

DateScoreModel / SystemSource
Nov 20231.9%Claude 2 (initial paper)Yang et al. 2023
Mar 20244.8%Claude 3 OpusAnthropic model card
May 202412.5%SWE-agent + GPT-4oPrinceton SWE-agent
Oct 202433.2%Claude 3.5 SonnetAnthropic model card
Dec 202449.0%Claude 3.5 Sonnet (agentic harness)Anthropic extended eval
Mar 202564.6%Claude 4 SonnetAnthropic model card
Sep 202571.3%GPT-5OpenAI model card
Apr 202674.5%Claude 4.5 OpusOfficial SWE-bench leaderboard

All scores on SWE-bench Verified (500 tasks) unless otherwise noted. Scores pre-August 2024 are on original SWE-bench or Lite and are not directly comparable.

How It Is Actually Scored

The evaluation harness gives the agent access to a Docker container running the target repository at the exact commit before the fix. The agent has access to the repository files, the issue description, and a test runner. It is not given the failing test directly - it must discover the test, understand the issue, and produce a patch.

Success requires two things simultaneously: all fail_to_pass tests (the tests that were failing before the fix) must now pass, AND all pass_to_pass tests (tests that were already passing) must still pass. This prevents trivially invalid patches - you cannot delete the failing test or disable all tests to score a pass.

The evaluation does not check whether the patch is good code - only whether it passes tests. This is both the benchmark's strength (objective, reproducible) and its limitation (a test suite is not a complete specification of correct behavior).

Known Limitations

Python only

All 12 repositories are Python. Performance on JavaScript, Go, Rust, or TypeScript codebases is unknown and potentially very different.

Test flakiness

Some tests pass or fail non-deterministically due to timing, resource availability, or randomness. Flaky tests inflate or deflate scores depending on the run.

Contamination risk

These are real public issues. Their solutions exist in the same repository's git history. Any model trained after the issue was closed may have seen the solution.

Production gap

A patch that passes tests is not necessarily production-ready. It might introduce security vulnerabilities, performance regressions, or violate style conventions that tests do not cover.

How to Interpret a SWE-bench Score

0-10%
Research baseline. Early agent systems, small models, or ablation studies. Not production-relevant.
10-30%
Emerging capability. Can handle simple bug fixes in familiar code patterns. Requires heavy human review.
30-50%
Serviceable assistant. Useful for routine maintenance tasks. Can draft fixes that often require minor human editing.
50-70%
Productive collaborator. Strong coding agent. Handles a wide range of issues. Human review still recommended.
70-80%
Frontier 2026. Best available as of April 2026. Still fails on complex multi-file refactors and subtle logic.
80%+
Either SOTA or the harness is doing most of the work. Ask: what harness? What tool access? Is this on Verified or Lite?

Frequently Asked Questions

Is SWE-bench the best coding benchmark?+
SWE-bench Verified is the most important coding-agent benchmark in 2026 because it measures real engineering tasks. For code-completion benchmarks, LiveCodeBench is more relevant. For pure code generation on isolated functions, HumanEval still exists but is saturated. The right benchmark depends on the task.
What is the difference between SWE-bench and HumanEval?+
HumanEval tests single-function Python completion given a docstring. SWE-bench Verified tests multi-file patch generation for real GitHub issues. The agent navigates a real repository, understands the issue, writes a patch that changes multiple files, and must not break any existing tests. The engineering complexity gap is enormous.
Can small models do well on SWE-bench Verified?+
Small models (7B-13B parameters) currently score below 5% on SWE-bench Verified. Models in the 30-70B range score 15-30% with strong harnesses. Frontier-class models (Claude 4.5, GPT-5) reach 70-75%. SWE-bench is one of the clearest cases where model scale strongly predicts task performance.
Is SWE-bench gameable?+
Partially. A patch that makes fail_to_pass tests pass by stubbing out the tested behavior would score a pass on SWE-bench but would be useless in production. The Verified subset reduces but does not eliminate this risk. The most reliable anti-gaming mechanism is checking whether the patch makes semantic sense, which requires human review.
What agentic harness should I compare?+
SWE-bench scores are harness-dependent. The agent's access to tools significantly affects results. Always ask: what harness was used? The official SWE-agent harness (from the Princeton team) is the most common. A score from a custom harness with extended tool access is not comparable to a score from the standard harness.

Sources

  1. [1] Yang et al., SWE-bench - arxiv.org/abs/2310.06770 - 2023
  2. [2] SWE-bench Verified Leaderboard - swebench.com - Captured April 2026
  3. [3] Anthropic Claude 4.5 Model Card - Captured April 2026
  4. [4] OpenAI GPT-5 Model Card - Captured April 2026
  5. [5] SWE-agent GitHub - github.com/princeton-nlp/SWE-agent