pass@k measures the probability that at least one of k generated solutions is correct. pass@1 means one attempt - does the single generated solution pass the tests? pass@10 means 10 attempts - does any of 10 generated solutions pass? pass@100 means 100 attempts. pass@1 is the most practical metric (the model gets one shot, like a human). Higher k metrics inflate apparent performance and are less relevant for single-inference deployment scenarios.

Status: Saturated

HumanEval SOTA: 98.1% (saturated)

LiveCodeBench SOTA: 74.8% (GPT-5)

SWE-bench Verified: 74.5% (Claude 4.5)

Last verified April 2026

HumanEval, MBPP, and LiveCodeBench - Code Benchmarks in 2026

Q: How does LiveCodeBench avoid contamination?

LiveCodeBench collects problems from LeetCode and AtCoder that were released after a specified date. When evaluating a model, only problems released after the model's training cutoff are used. This means the model cannot have seen the problems during training. As new problems are released and model training cutoffs are published, the benchmark updates automatically. This is a structural contamination prevention rather than a content-filtering approach.

HumanEval was the standard code-generation benchmark from 2021 through 2023. It is now saturated: frontier models score 96-98% pass@1, making it useless for comparing current models. This page explains the history of code benchmarks, why LiveCodeBench solved the contamination problem, and where to look for meaningful coding comparisons in 2026.

HumanEval

HumanEval was created by Mark Chen and colleagues at OpenAI (2021) as a benchmark for code generation. The dataset contains 164 Python programming problems, each consisting of a function signature and docstring. The model must complete the function body. Each problem has an average of 7.7 unit tests for evaluation.

The primary metric is pass@1: the fraction of problems where the model's single generated solution passes all unit tests. The 2021 baseline was approximately 29% (Codex, the first dedicated code model). GPT-4 reached 67% in 2023. By 2024, frontier models were at 85-90%. By April 2026, the best models exceed 98%.

At 98%+ SOTA, HumanEval has no discriminating power between frontier models. A 97% score and a 98% score differ by fewer than 2 problems out of 164. The problems are also simple by 2026 standards: they are single-function implementations of well-defined algorithms, mostly solvable with standard library functions. A capable model cannot be meaningfully ranked on this benchmark.

Era	Model	pass@1
2021 (launch)	Codex-12B	28.8%
2022	InstructGPT (text-davinci-002)	46.8%
2023-Q1	GPT-4	67.0%
2023-Q4	Claude 2	71.2%
2024-Q2	Claude 3.5 Sonnet	92.0%
2025	Claude 4 Sonnet	95.3%
Apr 2026	GPT-5	98.1%
Sources: OpenAI model cards, Anthropic model cards, Papers With Code. All pass@1. HumanEval 0-shot.

MBPP

MBPP (Mostly Basic Python Problems) was created by Jacob Austin and colleagues at Google Research (2021). The benchmark contains 974 Python programming problems ranging from simple mathematical calculations to basic data structure operations. Problems are intended to be solvable by beginning programmers.

MBPP has the same saturation problem as HumanEval. Frontier models score 97%+ pass@1. The benchmark was valuable in 2021-2022 for demonstrating that language models could handle basic programming tasks. By 2026 it is only useful for testing smaller models (7B-30B range) where there is still meaningful separation, or for quick regression testing.

LiveCodeBench - The Anti-Contamination Solution

LiveCodeBench was created by Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica (multiple institutions), published in 2024. Its key innovation: it draws problems from LeetCode and AtCoder that were released after a specified date. When evaluating a model, only problems released after the model's training cutoff are used.

This is structural contamination prevention. A model trained with a cutoff of January 2025 is evaluated only on problems published after January 2025. By construction, the model cannot have seen the test problems during training. HumanEval's problems are from 2021 and appear in web crawls; LiveCodeBench's problems are continuously refreshed.

As of April 2026, GPT-5 leads at 74.8% pass@1, with Claude 4.5 Opus at approximately 71.3% and Gemini 2.5 Pro at 69.7%. The 15-20 point gap between frontier models and mid-tier models provides real discrimination. LiveCodeBench is the recommended benchmark for code generation comparisons in 2026.

Where to Look Instead

LiveCodeBench

Anti-contamination by design. LeetCode/AtCoder problems post-training-cutoff. Best for 2026 code generation comparisons.

https://livecodebench.github.io/

SWE-bench Verified

Real GitHub issues, real repositories. Best for evaluating coding agents rather than code-completion models.

benchmarkingagents.com/swe-bench

BigCodeBench

Multi-language code benchmark with diverse task types beyond Python function completion. Better coverage of real-world coding scenarios.

https://bigcode-bench.github.io/

Terminal-Bench

Shell and system tasks for agents that need to work in terminal environments. Relevant for DevOps and SRE agents.

benchmarkingagents.com/agent-benchmarks

Frequently Asked Questions

Is HumanEval still useful in 2026?+

HumanEval is saturated. Frontier models score 96-98% pass@1. It no longer discriminates between frontier models. HumanEval is still useful for comparing new small or mid-sized models against established baselines, or for quick regression testing. For serious frontier comparisons in 2026, use LiveCodeBench or SWE-bench Verified.

What is the best coding benchmark in 2026?+

For code generation on isolated functions: LiveCodeBench (anti-contamination by design). For real software engineering capability: SWE-bench Verified (real GitHub issues, multi-file patches). For multi-language coverage: BigCodeBench. Avoid HumanEval and MBPP for frontier comparisons - both are saturated.

How does LiveCodeBench avoid contamination?+

LiveCodeBench collects problems from LeetCode and AtCoder released after a specified date. When evaluating a model, only problems released after the model's training cutoff are used. This means the model cannot have seen the problems during training. This is structural contamination prevention rather than a content-filtering approach.

What is pass@k?+

pass@k measures the probability that at least one of k generated solutions is correct. pass@1 means one attempt - does the single generated solution pass the tests? pass@10 means 10 attempts - does any of 10 solutions pass? pass@1 is the most practical metric for single-inference deployment scenarios. Higher k metrics inflate apparent performance.

SWE-bench deep dive ->All benchmarks ->Glossary (pass@k, N-shot) ->

Sources

[1] Chen et al., HumanEval - arxiv.org/abs/2107.03374 - 2021
[2] Austin et al., MBPP - arxiv.org/abs/2108.07732 - 2021
[3] Jain et al., LiveCodeBench - livecodebench.github.io - 2024
[4] Papers With Code HumanEval - paperswithcode.com - Captured April 2026