Editorial - Last verified April 2026

What LLM Benchmarks Don't Measure - Contamination, Saturation, Blind Spots

Q: Is MMLU still useful in 2026?

MMLU is useful for historical comparisons - comparing a 2026 model against GPT-3.5 or GPT-4 from 2022-2023. It is not useful for comparing current frontier models against each other, because all frontier models score 92-94% and the variance is within noise. Use MMLU-Pro (harder, 10-choice, CoT-required) for 2026 frontier comparisons. MMLU is also the most contaminated benchmark in the field - a 93% score may reflect memorisation as much as reasoning.

Q: How do I verify a benchmark score independently?

Check the official benchmark leaderboard first (SWE-bench at swebench.com, Chatbot Arena at lmsys.org, HuggingFace Open LLM Leaderboard). Cross-reference with Papers With Code which aggregates independently reported scores. For recent models, check AI safety researchers who independently replicate benchmark claims - EleutherAI, Allen AI, and Epoch AI frequently publish independent evaluations. Note the methodology: N-shot, CoT, harness, and test set version must match for comparison to be valid.

Benchmarks are the best objective measurement tools we have. They are also systematically misleading in ways that the ML community discusses privately and publishes rarely. This page is the public version of those private discussions.

This is not an argument against benchmarks. A benchmark is a claim, not a fact. That distinction changes how you should read every leaderboard, every model card, and every “State of AI” report. Read them with that framing, and they are useful. Read them as facts, and they will mislead you.

Five Honest Problems with the Benchmark Landscape

§ 01

Contamination

Training-data contamination is the most serious structural problem in the benchmark ecosystem. It occurs when test questions from a benchmark appear in a model's training data. The model then solves benchmark problems by recalling memorised answers rather than by reasoning through them.

Documented cases: MMLU test questions have been found verbatim in Common Crawl, the primary web-crawl dataset used in most pre-training corpora. Researchers at Allen AI (2024) found that approximately 5-10% of MMLU test questions appear with high similarity in standard pre-training datasets. HumanEval problems are near-duplicates of LeetCode solutions that appear extensively in GitHub and Stack Overflow crawls, both of which are common pre-training sources. SWE-bench issues use real GitHub issues whose solutions are in the same repository's git history - any model trained on GitHub data after the fix was committed may have seen the solution.

The contamination problem is structural, not a matter of bad actors. Pre-training corpora are assembled from web crawls that include a huge fraction of the public internet. Public benchmarks, by definition, appear on the internet. The solution is either: (1) new benchmarks with test questions that cannot have appeared in training data (LiveCodeBench's approach: only use problems released after the model's training cutoff), or (2) test sets that are never made public (the ARC Prize private test set approach).

MMLU contamination details ->

§ 02

Saturation

A benchmark saturates when frontier models all score above approximately 90%, making it unable to discriminate between them. Saturation is not a failure mode - it is a success story. A benchmark saturates when the field has genuinely mastered the capability it was designed to measure.

The problem is not saturation itself but the continued citation of saturated benchmarks as if they provide meaningful signal. MMLU saturated in 2024. HumanEval saturated in 2024. HellaSwag saturated in 2023. WinoGrande saturated in 2023. These are facts, not opinions.

The saturation timeline is accelerating. BIG-Bench Hard was designed in 2022 to be hard for frontier models. By April 2026, SOTA is 94.3% and the benchmark is approaching saturation within 12 months. GPQA-Diamond, which had substantial headroom in 2024, is now at 78.4% SOTA with all frontier models between 71-78%. It will saturate within 18-24 months at current rates.

The practical consequence: any model comparison that relies primarily on MMLU, HumanEval, or HellaSwag is useless for distinguishing 2026 frontier models. The comparison table that a vendor publishes with a 94% MMLU score is comparing noise against noise.

Current benchmark status list ->

§ 03

Self-Reporting Conflicts

Model cards are published by the same company that built the model. This is a structural conflict of interest that creates several documented problems.

First, benchmark selection bias. A company can choose to publish scores only on benchmarks where its model performs well. A model card that includes MMLU (where all frontier models score 92-94%) but omits GPQA-Diamond or SWE-bench Verified is making an implicit claim about relative performance that may not hold. Look for what is absent from a model card, not just what is present.

Second, methodology omissions. Publishing a benchmark score without specifying N-shot setting, CoT flag, temperature, maximum output tokens, and test set version makes the score unverifiable and uncomparable. "We scored 94.2% on MMLU" is not a meaningful claim if you do not specify whether it was 0-shot or 5-shot, with or without CoT, on the full test set or a subset.

Third, independent replications frequently diverge. Allen AI's OLMES framework, which standardises evaluation methodology, has found that independently replicating vendor-reported scores often yields different numbers - sometimes higher (suggesting the vendor used conservative settings), more often lower (suggesting the vendor used favorable settings). The gap is usually 1-5%, but for benchmarks where models cluster tightly, this matters.

The benchmark-literate reader's heuristic: treat vendor-published scores as an upper bound on performance under maximally favorable conditions. Cross-reference with independent leaderboards (HuggingFace, Papers With Code, LMSYS Arena) before accepting the claim.

Independent benchmark sources ->

§ 04

Methodology Gaming

"Best-of-16 with CoT and tool use" is not comparable to "greedy 0-shot." This is the simplest and most routinely ignored problem in benchmark reporting.

Benchmark scores are a function of the model AND the evaluation setup. A model evaluated with 5-shot CoT, best-of-3 sampling, and access to a Python executor will score dramatically higher on coding benchmarks than the same model evaluated 0-shot greedy without tools. Both are valid evaluation setups for different purposes. They are not comparable numbers.

Documented examples: Several frontier models in 2024-2025 published SWE-bench scores using agentic harnesses with extended tool access and multi-step reasoning. These scores are not comparable to single-agent baseline scores from the original paper. GPT-4o's "86% on MMLU" is 5-shot with CoT; a competitor's "85% on MMLU" may be 0-shot without CoT. The 1-point gap is noise; the methodology gap is real.

The benchmark-literate reader's checklist: before comparing two scores, verify that they used the same N-shot, the same CoT setting, the same temperature, the same max tokens, the same test set version, and the same success criterion. If any of these differ, the comparison is invalid.

Glossary: N-shot, CoT, pass@k ->

§ 05

The Real-World Gap

A 74.5% SWE-bench Verified score is an impressive engineering achievement. A 74.5% score on "can this agent actually ship production PRs that a senior engineer would approve without modification" would be a different, much harder claim.

Benchmark tasks are designed to be evaluable - they have clear, computable success criteria. Real-world tasks are messier. The 12 Python repositories in SWE-bench Verified do not represent the diversity of real engineering organizations. The issues were selected from a specific time range and do not include the kinds of ambiguous, under-specified feature requests that fill real engineering backlogs. The test suite coverage is not 100%.

The gap extends to all categories of benchmark. GPQA-Diamond measures PhD-level scientific reasoning under controlled conditions (multiple choice, no time pressure, English language, single-domain questions). Real scientific reasoning involves ambiguity, multi-step problem setup, cross-domain synthesis, and communication to non-expert audiences. A model with 78% GPQA-Diamond is impressive; it is not a PhD-equivalent scientist.

Chatbot Arena measures human preference in English-language open-ended chat. A high Arena Elo does not predict whether the model will perform well in a specific vertical (medical, legal, educational), with a specific user population (non-native English speakers, elderly users, users with specific access needs), or in a specific product context (chatbot with a constrained system prompt and a specific persona).

Benchmarks are the best objective measurement tools we have. Read them as claims with specific, narrow scope, not as certificates of general capability.

Agent benchmark limitations ->

What Benchmarks Don't Measure (But Should)

Long-context reliability

Not needle-in-haystack (find this sentence in 128K tokens). Real multi-document reasoning: given 20 research papers, synthesise the evidence on X. Current benchmarks test whether models can find one piece of information in a long context. They do not test whether models can reason coherently across multiple documents while tracking contradictions, source credibility, and evidence weight.

Tool-use under failure

What happens when the API returns a 500 error? What happens when the search tool returns no results? What happens when the code execution environment crashes? Benchmarks test best-case tool use. Production agents face failure modes that benchmarks do not include.

Cost per correct answer

A model that costs 10x more per inference call to achieve 5% higher accuracy is usually worse than the cheaper model for production deployment. Benchmarks report accuracy; cost-adjusted accuracy is rarely reported and almost never comparable across model families.

Latency tail (p99, not p50)

A model with 800ms median latency but 12 second p99 latency is unacceptable for real-time applications. Benchmarks measure accuracy; latency tail is almost never reported and varies dramatically across providers, hardware, and load conditions.

Persona steering under pressure

Can the model maintain a specific persona (medical professional, customer service representative, educational tutor) across a long conversation where users actively try to break the persona? Not tested in any standard benchmark.

Safety in agentic loops

Does the agent ask for confirmation before taking irreversible actions (deleting files, sending emails, making purchases)? Does it stop when it detects an action would cause harm? These behaviors matter enormously for production agentic deployment and are not captured by capability benchmarks.

How to Read a Benchmark Score Responsibly

A 6-item checklist. Before accepting any benchmark claim:

1.
Capture date: When was this score recorded? A model card published in Q1 2025 may cite GPQA-Diamond scores that are now a year stale and have been superseded by newer evaluations.
2.
N-shot setting: 0-shot, 1-shot, or 5-shot? CoT or not? A 5-shot CoT score can be 5-15% higher than a 0-shot no-CoT score on the same benchmark. Verify the setting matches what you care about.
3.
Harness (if agentic): For agent benchmarks: what tools did the agent have access to? Was it the standard harness or a custom setup? An agent with extended tool access and multi-step reasoning is not comparable to a baseline agent.
4.
Official leaderboard vs vendor claim: Does the score appear on the benchmark's official leaderboard (swebench.com, arcprize.org, lmsys.org) or only in the vendor's model card? Official leaderboard scores are independently verified.
5.
Test set version: Is this SWE-bench, SWE-bench Lite, or SWE-bench Verified? MMLU or MMLU-Pro? ARC-AGI or ARC-AGI-2? Version differences make scores incomparable.
6.
Contamination note: For knowledge and coding benchmarks: is there known contamination? HumanEval, MMLU, and MBPP all have documented training-data overlap. A score on a contaminated benchmark is an upper bound, not a ground-truth capability estimate.

The Close

Benchmarks are not bad. The alternative - no standardised measurement, purely subjective impression - is worse. Benchmarks give us shared reference points, comparable history, and falsifiable claims. MMLU genuinely captured something real about knowledge breadth in 2020-2023. HumanEval genuinely captured something real about code generation in 2021-2023. SWE-bench Verified is capturing something real about software engineering agents in 2024-2026.

The critique is not “benchmarks are bad.” The critique is: benchmarks are claims with specific, narrow scope. A model with 94% MMLU is not a model that knows everything. A model with 74.5% SWE-bench Verified is not a model that can do all software engineering. A model with Elo 1401 on Chatbot Arena is not the best model for your specific use case.

Read them as claims. Check the methodology. Cross-reference with independent sources. And build your own evals for the specific task your system actually needs to perform. That last step - custom evaluation on your real workflow - is the only way to know whether a model is good for your use case. Everything on the leaderboards tells you the model is capable in a general sense. Only your own eval tells you whether it is capable for your specific deployment.

Frequently Asked Questions

Can I trust vendor-published benchmark scores?+

Treat vendor-published scores as a starting point, not a final answer. Model cards are produced by the same company that built the model. Check for: benchmark selection bias (only publishing where the model performs well), methodology omissions (N-shot and CoT not specified), and compare against independent leaderboards (Papers With Code, HuggingFace Open LLM Leaderboard v2).

What is training-data contamination?+

Training-data contamination occurs when test questions from a benchmark appear in a model's training data. The model may then 'solve' benchmark problems by recalling memorised answers rather than reasoning through them. Documented cases: MMLU questions appear verbatim in Common Crawl; HumanEval problems are near-duplicates of LeetCode solutions; SWE-bench issues have solutions in public git history.

Is MMLU still useful in 2026?+

MMLU is useful for historical comparisons with pre-2024 models. It is not useful for comparing current frontier models against each other, because all frontier models score 92-94% and the variance is within noise. Use MMLU-Pro for 2026 frontier comparisons. MMLU is also the most contaminated benchmark in the field.

Why do Chatbot Arena and MMLU disagree on model ranking?+

They measure different things. MMLU measures knowledge breadth across academic domains. Chatbot Arena measures human preference in open-ended conversation. A model can score high on MMLU while ranking lower on Arena if it is factually accurate but cold or unhelpful in tone.

How do I verify a benchmark score independently?+

Check the official benchmark leaderboard first (swebench.com, arcprize.org, lmsys.org). Cross-reference with Papers With Code which aggregates independently reported scores. Check independent researchers (EleutherAI, Allen AI, Epoch AI) who frequently publish independent evaluations. Note that methodology must match for comparison to be valid.

Benchmark reference ->MMLU deep dive ->SWE-bench caveats ->Human vs automated ->