HumanEval, MBPP, and LiveCodeBench - Code Benchmarks in 2026
HumanEval was the standard code-generation benchmark from 2021 through 2023. It is now saturated: frontier models score 96-98% pass@1, making it useless for comparing current models. This page explains the history of code benchmarks, why LiveCodeBench solved the contamination problem, and where to look for meaningful coding comparisons in 2026.
HumanEval
HumanEval was created by Mark Chen and colleagues at OpenAI (2021) as a benchmark for code generation. The dataset contains 164 Python programming problems, each consisting of a function signature and docstring. The model must complete the function body. Each problem has an average of 7.7 unit tests for evaluation.
The primary metric is pass@1: the fraction of problems where the model's single generated solution passes all unit tests. The 2021 baseline was approximately 29% (Codex, the first dedicated code model). GPT-4 reached 67% in 2023. By 2024, frontier models were at 85-90%. By April 2026, the best models exceed 98%.
At 98%+ SOTA, HumanEval has no discriminating power between frontier models. A 97% score and a 98% score differ by fewer than 2 problems out of 164. The problems are also simple by 2026 standards: they are single-function implementations of well-defined algorithms, mostly solvable with standard library functions. A capable model cannot be meaningfully ranked on this benchmark.
| Era | Model | pass@1 |
|---|---|---|
| 2021 (launch) | Codex-12B | 28.8% |
| 2022 | InstructGPT (text-davinci-002) | 46.8% |
| 2023-Q1 | GPT-4 | 67.0% |
| 2023-Q4 | Claude 2 | 71.2% |
| 2024-Q2 | Claude 3.5 Sonnet | 92.0% |
| 2025 | Claude 4 Sonnet | 95.3% |
| Apr 2026 | GPT-5 | 98.1% |
| Sources: OpenAI model cards, Anthropic model cards, Papers With Code. All pass@1. HumanEval 0-shot. | ||
MBPP
MBPP (Mostly Basic Python Problems) was created by Jacob Austin and colleagues at Google Research (2021). The benchmark contains 974 Python programming problems ranging from simple mathematical calculations to basic data structure operations. Problems are intended to be solvable by beginning programmers.
MBPP has the same saturation problem as HumanEval. Frontier models score 97%+ pass@1. The benchmark was valuable in 2021-2022 for demonstrating that language models could handle basic programming tasks. By 2026 it is only useful for testing smaller models (7B-30B range) where there is still meaningful separation, or for quick regression testing.
LiveCodeBench - The Anti-Contamination Solution
LiveCodeBench was created by Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica (multiple institutions), published in 2024. Its key innovation: it draws problems from LeetCode and AtCoder that were released after a specified date. When evaluating a model, only problems released after the model's training cutoff are used.
This is structural contamination prevention. A model trained with a cutoff of January 2025 is evaluated only on problems published after January 2025. By construction, the model cannot have seen the test problems during training. HumanEval's problems are from 2021 and appear in web crawls; LiveCodeBench's problems are continuously refreshed.
As of April 2026, GPT-5 leads at 74.8% pass@1, with Claude 4.5 Opus at approximately 71.3% and Gemini 2.5 Pro at 69.7%. The 15-20 point gap between frontier models and mid-tier models provides real discrimination. LiveCodeBench is the recommended benchmark for code generation comparisons in 2026.
Where to Look Instead
LiveCodeBench
Anti-contamination by design. LeetCode/AtCoder problems post-training-cutoff. Best for 2026 code generation comparisons.
https://livecodebench.github.io/SWE-bench Verified
Real GitHub issues, real repositories. Best for evaluating coding agents rather than code-completion models.
benchmarkingagents.com/swe-benchBigCodeBench
Multi-language code benchmark with diverse task types beyond Python function completion. Better coverage of real-world coding scenarios.
https://bigcode-bench.github.io/Terminal-Bench
Shell and system tasks for agents that need to work in terminal environments. Relevant for DevOps and SRE agents.
benchmarkingagents.com/agent-benchmarksFrequently Asked Questions
Is HumanEval still useful in 2026?+
What is the best coding benchmark in 2026?+
How does LiveCodeBench avoid contamination?+
What is pass@k?+
Sources
- [1] Chen et al., HumanEval - arxiv.org/abs/2107.03374 - 2021
- [2] Austin et al., MBPP - arxiv.org/abs/2108.07732 - 2021
- [3] Jain et al., LiveCodeBench - livecodebench.github.io - 2024
- [4] Papers With Code HumanEval - paperswithcode.com - Captured April 2026