Abstract

Headline trioSWE-bench Verified (engineering), LiveCodeBench (generation), Terminal-Bench (shell)

Saturated, skipHumanEval, MBPP, APPS for frontier comparisons

2026 frontierSWE-bench Verified low-to-mid 70s; LiveCodeBench 72-78%; Terminal-Bench 55-65%

Contamination-cleanestLiveCodeBench (rolling cutoff design)

Section II.xii · Benchmark Comparison|Last verified April 2026

Coding-Agent Benchmarks: The 2026 Selection Guide

The coding-benchmark landscape in 2026 is settled enough to recommend a clear default. Quote SWE-bench Verified for engineering, LiveCodeBench for code generation, Terminal-Bench for shell work. Skip HumanEval and MBPP for frontier comparisons. Pick one or two from the headline trio based on what your agent actually does.

The headline trio

By May 2026 the coding-benchmark landscape has consolidated around three benchmarks that genuinely discriminate frontier models and that match three meaningfully different aspects of coding capability. SWE-bench Verified measures real-engineering capability: navigate a multi-file repository, understand a GitHub issue, write a patch that passes the failing test without breaking existing tests. LiveCodeBench measures code-generation-from-spec: write a function that solves a competitive-programming problem within time and memory limits. Terminal-Bench measures shell-living agent capability: drive a real Linux container through scripting, sysadmin, debugging, and DevOps tasks.

The three benchmarks are largely orthogonal. A model that scores 75 percent on LiveCodeBench might score 50 percent on SWE-bench Verified; the inverse pattern is also seen. The two benchmarks measure different things: pure generation skill versus engineering-in-context skill. Adding Terminal-Bench gives a third independent dimension: interactive shell competence, which neither of the first two captures. A serious coding-agent claim quotes at least two of the three; the strongest claims quote all three.

What unifies the three is that they all have programmatic success functions (tests pass, commands succeed, files match expected state) rather than LLM-as-judge scoring. This makes them more reproducible and less subject to evaluator drift than benchmarks that depend on subjective grading. The trade-off is that the success functions can occasionally over-credit (a partial completion that happens to pass the check) or under-credit (a correct solution in an unexpected format), but both error modes are bounded and well-understood.

Benchmark-by-benchmark comparison

The full landscape of coding benchmarks in 2026 includes the headline trio plus several historical or niche benchmarks. The picture below summarises what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

2026 frontier

Note

Recommend

SWE-bench Verified

Real GitHub issues, multi-file patches, real test suites

Low-to-mid 70s pass@1

12 Python repos only; harness sensitive

Yes (headline for engineering)

LiveCodeBench

Competitive-programming code generation, rolling cutoff

72-78% pass@1 (overall)

Competitive-programming patterns dominate

Yes (headline for code generation)

Terminal-Bench

Shell agent tasks in Docker containers

55-65% (overall, May 2026)

Linux-only; relatively new

Yes (headline for shell agents)

BigCodeBench

Multi-language code with library function use

~70%+ saturating

Saturating; less discriminating

Useful complement, not headline

HumanEval

Single-function Python code from docstring

96-99% (saturated)

Saturated, contaminated

No for frontier comparison

MBPP

Python programming problems from spec

~95% saturated

Saturated, contamination documented

No for frontier comparison

APPS

Three-tier difficulty programming problems

Variable; less commonly cited

Contamination, less discriminating

Niche use only

Use-case-by-use-case selection guide

The right benchmark depends on what your coding agent does. The table below gives the recommended primary and secondary benchmark for each common coding-agent use case. The primary is the headline number to lead with; the secondary is the complementary signal that makes the comparison more complete.

Use case

Primary benchmark

Secondary

Software engineering agent (PR-generating)

SWE-bench Verified

Terminal-Bench

Code generation from spec (e.g. Cursor/Copilot)

LiveCodeBench

BigCodeBench

DevOps / SRE / shell-living agent

Terminal-Bench

SWE-bench Verified

IDE-integrated coding assistant

SWE-bench Verified + LiveCodeBench

ClassEval for OOP-heavy code

Smaller / open-weight code model evaluation

LiveCodeBench (fresh)

HumanEval (only as historical floor)

Multi-language code generation

BigCodeBench

LiveCodeBench

Why HumanEval and MBPP belong in the rear-view mirror

HumanEval and MBPP were the standard code benchmarks from 2021 through 2023 and their saturation in 2024 is what created the need for the headline trio above. Both benchmarks now suffer from three compounding problems. First, they are saturated: frontier models score 96-99 percent on HumanEval and similar levels on MBPP, leaving no discriminating power between top models. Second, they are contaminated: both benchmarks have been documented in pre-training corpora, and high scores partly reflect memorisation rather than capability. Third, they are too easy: the problems are single-function exercises that do not test the multi-file reasoning, repository navigation, or test-suite-aware editing that real engineering work requires.

The right use for HumanEval in 2026 is as a sanity floor for evaluating new small or open-weight models: if your evaluation pipeline reports a frontier-model HumanEval score below 95 percent, the pipeline is misconfigured. As a frontier comparison metric, HumanEval has been replaced by LiveCodeBench in essentially every credible evaluation. MBPP follows the same pattern.

The wider lesson is that benchmarks have a useful lifespan. HumanEval served the field well for three years; its retirement is a successful evolution rather than a failure. The new headline trio will likely follow a similar arc; expect SWE-bench Verified to need a successor by 2027 or 2028 as the frontier closes on saturation. The pattern is documented further at what these benchmarks miss.

Harness sensitivity is the hidden variable

Coding-agent benchmark scores depend not just on the underlying model but on the agentic harness that wraps it. The same model in a minimal harness and a strong harness can score 20 points differently on SWE-bench Verified. The strongest published SWE-bench scores use proprietary scaffolding from Anthropic, OpenAI, and well-funded research labs; community submissions using open frameworks (LangGraph, AutoGen, SWE-agent, Aider) typically score 15-25 points lower with the same underlying model.

When citing a coding benchmark score, the harness matters as much as the model. "Claude 4 reaches 73 percent on SWE-bench Verified" without a harness disclosure is roughly meaningless because the score depends heavily on the scaffolding. The honest pattern is "[model] in [harness] reaches X percent on [benchmark], according to [source]". See our SWE-bench Verified deep dive for the harness-sensitivity discussion in detail.

Combining the headline trio

Two practical patterns for combining the headline benchmarks in a single comparison. First, the "triangle" pattern: report SWE-bench Verified, LiveCodeBench overall, and Terminal-Bench overall, all for the same model and harness. The triangle gives a complete picture of engineering, generation, and shell-agent capability. This is the pattern we recommend for new model releases.

Second, the "use-case-weighted" pattern: pick the benchmark that most closely matches the deployment use case as the headline, and quote the others as secondary signals. This is the pattern we recommend for production-deployment decisions: if your agent does engineering work, lead with SWE-bench Verified; if it does code generation, lead with LiveCodeBench; if it does shell work, lead with Terminal-Bench.

Either pattern is more informative than reporting a single benchmark score, which inevitably misses some axis of capability. The combination is what makes the comparison robust.

What about real-world deployment readiness?

A high score on the headline trio is necessary but not sufficient for production deployment readiness. Three additional axes matter that the public benchmarks do not capture. First, latency and cost: an agent that scores 75 percent on SWE-bench Verified but takes 30 minutes per task at $5 per task may not be deployable at production scale. Second, robustness across the long tail: benchmark distributions are biased toward common task types; production agents face tail events the benchmarks under-sample. Third, integration friction: an agent that runs in isolation on benchmark infrastructure may be hard to integrate into a team's actual development workflow.

Public benchmarks measure capability under controlled conditions. Production deployment also requires capability under uncontrolled conditions, at acceptable latency and cost, with manageable integration friction. The honest claim about a coding agent is that benchmark numbers are a floor, not a ceiling, and that production-readiness work begins where benchmark wins end. See our production monitoring guide for the operational complement to benchmark capability.

Editor's verdictFor coding-agent claims in 2026, lead with SWE-bench Verified for engineering, LiveCodeBench for generation, Terminal-Bench for shell. Quote at least two together. Disclose the harness. Treat HumanEval as a sanity floor, not a frontier comparison.

Reader Questions

Q.01Which coding benchmark should I quote in 2026?+

For real software engineering, SWE-bench Verified. For code generation from spec, LiveCodeBench. For shell-living agents, Terminal-Bench. Avoid HumanEval and MBPP for frontier comparisons; both are saturated and contaminated. The best 2026 coding-agent claim quotes at least two of these together: SWE-bench Verified plus one of LiveCodeBench or Terminal-Bench, depending on the workload your agent targets.

Q.02Why is HumanEval considered saturated?+

Frontier models score 96-99 percent pass@1 on HumanEval in 2026. The 164-problem corpus has been in training data for years, the problems are simple by 2026 standards, and the discriminating power between top models is essentially zero. A 97 percent score and a 99 percent score differ by fewer than 4 problems out of 164, well within rollout variance. HumanEval was useful from 2021 to 2023; it is not useful for current model comparison.

Q.03What's the difference between LiveCodeBench and SWE-bench Verified?+

LiveCodeBench tests competitive-programming-style single-function code generation: read a problem, write code, optimise for correctness and complexity. SWE-bench Verified tests real software engineering: navigate a multi-file repository, understand a GitHub issue, write a patch, ensure existing tests still pass. LiveCodeBench is generation-from-spec; SWE-bench is engineering-in-a-repo. The two benchmarks measure largely disjoint capabilities. Quote both for serious coding-agent claims.

Q.04Where does Terminal-Bench fit?+

Terminal-Bench tests shell-using agents on containerised Linux tasks: scripting, system administration, debugging, data manipulation, security, DevOps. It is the right benchmark for agents that live in a terminal between IDE sessions, including any DevOps or SRE-shaped agent. SWE-bench measures the patch-authoring slice of software work; Terminal-Bench measures the rest of the engineering job. The two are complementary and increasingly quoted together.

Q.05What about BigCodeBench, MBPP, APPS, ClassEval?+

Niche or saturated. BigCodeBench is broader than HumanEval (multi-language, more libraries) but is also saturating fast; useful as a complement, not a headline. MBPP is older and contaminated. APPS is older still. ClassEval tests class-level code generation and is a useful niche benchmark for OOP-heavy code. None displace SWE-bench Verified, LiveCodeBench, and Terminal-Bench as the headline three for 2026 frontier comparisons.

Q.06Are there contamination-resistant code benchmarks beyond LiveCodeBench?+

LiveCodeBench is the cleanest contamination-resistant code-generation benchmark in 2026 because of its rolling-cutoff design. SWE-bench Verified is partly contamination-resistant (the issues are public but the patches require specific repository understanding that is harder to memorise). Terminal-Bench is moderately contamination-resistant due to its freshness and Docker isolation. The wider concern is documented at /benchmark-contamination.

SWE-bench Verified →LiveCodeBench →Terminal-Bench →HumanEval and MBPP →Agent Benchmarks Overview →Benchmark Contamination →What Benchmarks Miss →

Sources

[1] SWE-bench Verified leaderboard. swebench.com. Accessed May 2026.
[2] LiveCodeBench project site. livecodebench.github.io.
[3] Terminal-Bench project site. terminal-bench.dev.
[4] BigCodeBench project. bigcode-bench.github.io.