Abstract

WhatTest-set items leaking into model pre-training corpora; inflates scores through memorisation

Documented inMMLU, HumanEval, MBPP, partly GSM8K and SWE-bench original

Structural fixesRolling cutoff (LiveCodeBench), private test set (GAIA), expert-curated rotation (HLE)

DetectionCross-check scores against contamination-resistant equivalents

Section IV.ii · Methodology|Last verified April 2026

Benchmark Contamination: How It Happens, Where It Bites, How to Read Around It

The contamination problem is older than the LLM benchmark race. MMLU questions appear verbatim in Common Crawl. HumanEval problems echo LeetCode solutions. MBPP has been discussed for years. The structural fixes are now well-understood (rolling cutoff, private test set, expert-curated rotation), and the right way to read any benchmark score is to cross-check against a contamination-resistant equivalent.

What contamination is and why it matters

Benchmark contamination is the presence of test-set items, or near-duplicates of them, in a model's pre-training corpus. When test items leak into training, the model can score correctly through memorisation rather than through the underlying capability the benchmark is supposed to measure. The result is benchmark scores that overstate genuine capability and that mislead evaluators who treat the scores as honest signals.

The problem matters because the entire enterprise of benchmark-based model comparison depends on the assumption that benchmarks measure capability. A benchmark that has been contaminated does not measure capability; it measures memorisation. Two models can have very different scores on a contaminated benchmark not because they have different capabilities but because they were trained on different corpora that contained different fractions of the test items. The score difference is benchmark noise, not capability signal.

The contamination problem is older than the LLM benchmark race. Image classification benchmarks faced similar issues in the 2010s; specific test images appeared in scraped training corpora, inflating scores. The LLM era has made the problem worse because pre-training corpora are larger and more diverse, and benchmarks are released into the same web that pre-training data is scraped from. The structural problem is essentially unavoidable for any benchmark with publicly available test items; the only fix is to design the benchmark to be contamination-resistant from the start.

Three pathways for contamination

Contamination happens through three distinct pathways, each with different detectability and different remediation strategies.

First, direct verbatim leak. The test items (and sometimes their answers) appear in pre-training data because the benchmark dataset was published years before model training. MMLU is the canonical example: the test questions were released in 2020, every major LLM since 2022 has been trained on web crawls that include the MMLU paper, the test set, and various downstream discussions. Direct verbatim leak is the easiest pathway to detect (search the pre-training corpus for the literal test text) and the hardest to remediate without retraining the model from scratch.

Second, near-duplicate leak. Paraphrased or stylistically variant versions of test items appear in training data. HumanEval problems are largely paraphrases of LeetCode-style algorithmic exercises that appear in countless coding tutorials, GitHub repos, and Stack Overflow threads. The model has not memorised the exact HumanEval prompt, but it has seen many similar problems and learned the patterns. Near-duplicate leak is harder to detect and harder to remediate; the underlying patterns may be necessary training signal even if the specific items are problematic.

Third, indirect leak. Training data includes discussions of the benchmark, its questions, or its solutions, even if the literal text differs. A model trained on web data including academic discussions of MMLU, HumanEval evaluation methodology, or comparison tables of benchmark scores has been exposed to contextual signal about the benchmark even without seeing the test items themselves. This is the hardest contamination pathway to quantify; it shades into legitimate training on the underlying field knowledge that the benchmark is testing.

Documented contaminated benchmarks

Contamination has been documented for several widely-cited benchmarks. The table below summarises the cases where the evidence is strongest and the impact on score interpretation is clearest.

Benchmark

Evidence of contamination

Impact on scores

MMLU

Test questions appear verbatim in Common Crawl; documented in multiple papers

Frontier scores inflated by 5-10 points; replaced by MMLU-Pro for serious comparison

HumanEval

Problems echo LeetCode solutions in pre-training; documented since 2022

Frontier scores 95-99% (saturated); replaced by LiveCodeBench (rolling cutoff)

MBPP

Test items discussed extensively on the web since 2021 release

Saturated; rarely cited for current frontier comparisons

GSM8K

Some test items appear in subsequent crawls and discussions

Partially contaminated; remains useful as foundational reference

SWE-bench (original)

Public GitHub issues with public solutions in git history

Verified subset partially mitigates; structural risk remains

WebArena

Self-hosted apps reduce contamination but Magento/Reddit-style UIs are common in training data

Modest soft contamination; the success-function design largely contains it

The pattern: most pre-2024 benchmarks with publicly available test items have meaningful contamination concerns by 2026. The fix has been to either replace the benchmark with a contamination-resistant successor (MMLU to MMLU-Pro and HLE; HumanEval to LiveCodeBench) or to add structural defences to the original (SWE-bench to SWE-bench Verified). The pattern will continue: any current benchmark with publicly available test items will face contamination concerns within 18-24 months as the test items propagate through training corpora.

Six approaches to contamination resistance

Six structural approaches have emerged for building contamination-resistant benchmarks. Each has trade-offs; the best contemporary benchmarks combine two or three.

Approach

How it works and where it fits

Rolling cutoff (LiveCodeBench, HLE rotated subset)

Score only items released after the model's training cutoff. Eliminates direct contamination by construction. Trade-off: different models scored on different question sets, complicating direct comparison.

Private test set (GAIA, HLE held-out portion)

Test items never released in plaintext; submissions return scores without ground truth. Eliminates verbatim leak. Trade-off: requires trusted leaderboard infrastructure.

Expert curation with periodic refresh (HLE, GPQA-Diamond)

Small set of newly authored items released in batches; older items retired periodically. Limits exposure window. Trade-off: smaller per-batch sample size; expert curation is expensive.

Execution-based success criteria (SWE-bench Verified, OSWorld, WebArena)

Memorising test items does not help if the model still has to execute. Mitigates but does not eliminate indirect contamination via familiarity with similar patterns.

Content filtering (some MMLU follow-ups)

Detect and remove items present in known training corpora. Less robust than structural approaches because it depends on knowing which corpora the model trained on.

Bootstrap variance reporting

Run the benchmark on multiple non-overlapping subsets to estimate score variance. Helps distinguish signal from noise but does not address contamination directly.

The most robust approaches are structural: rolling cutoff (LiveCodeBench) and private test set (GAIA). These eliminate contamination by construction rather than mitigating it after the fact. The expert-curated rotation approach (HLE, GPQA-Diamond) is also strong but requires expensive curation work that limits how often the test set can be refreshed.

Detecting contamination indirectly

Most evaluators do not have direct access to a model's training corpus and cannot run the literal leak-detection check (searching the corpus for test items). Three indirect detection signals are available.

First, score gaps between conceptually-equivalent benchmarks. If a model scores much higher on one benchmark than on a contamination-resistant equivalent, contamination is a candidate explanation. The HumanEval (95-99 percent) versus LiveCodeBench (72-78 percent) gap is the clearest current example: same underlying capability (code generation from spec), very different scores, and the difference is largely explained by contamination on HumanEval.

Second, score patterns across model generations. A benchmark where every new model generation shows large score gains, while conceptually-equivalent benchmarks show smaller gains, is a contamination warning sign. The benchmark may be measuring how well each new pre-training corpus has absorbed the test items rather than how much each new model has improved capability.

Third, the "model-generation drift" check. A model fine-tuned on data that includes the benchmark's test items will score much higher on the benchmark than the same base model without the fine-tuning. The size of this gap is a contamination signal: if fine-tuning on test items produces a 20-point lift, contamination is a substantial concern; if it produces only a 1-2 point lift, the benchmark is more contamination-resistant than expected. Some research papers run this check explicitly as a contamination probe.

How to read benchmark scores in light of contamination

Three practical rules for reading benchmark scores honestly. First, prefer contamination-resistant benchmarks for headline claims. Quote MMLU-Pro or HLE rather than MMLU, LiveCodeBench rather than HumanEval, GAIA rather than older assistant benchmarks. The replacement-pattern is well-established for the major capability axes.

Second, cross-check between benchmarks. Quote two benchmarks measuring overlapping capability and check whether they tell the same story. Same-story scores are likely measuring genuine capability; widely-divergent scores are a warning that one of the benchmarks may be contaminated or measuring a different slice than you think.

Third, treat contamination as one source of noise among several. Contamination is not the only reason benchmark scores can mislead: harness sensitivity, evaluation methodology, sample-size noise, and score-format choices all contribute. The honest 2026 approach to model evaluation is to use multiple benchmarks, multiple harnesses where possible, and multiple model versions to triangulate capability rather than to over-rely on any single score.

The future of contamination resistance

The contamination problem will not be solved; it will be managed. Every benchmark with publicly available test items eventually contaminates as the items propagate through training corpora. The structural fixes (rolling cutoff, private test set, expert-curated rotation) extend the useful life of a benchmark but do not make it permanent. The right expectation is that benchmarks have a 2-5 year useful life followed by gradual contamination, and that the field will continue to release successor benchmarks every 18-24 months to replace the worst-contaminated ones.

The 2026 generation of benchmarks (MMLU-Pro, GPQA-Diamond, HLE, LiveCodeBench, SWE-bench Verified, GAIA, OSWorld) will themselves age. Expect MMLU-Pro to lose discriminating power by 2027-2028; expect GPQA-Diamond to be replaced by GPQA-Plat or equivalent in the same window. The pattern is not failure; it is the field iterating toward better measurements as old measurements become unreliable. The right discipline is to track the evolution and to update the benchmarks you cite as their successors emerge.

Editor's verdictContamination is unavoidable for benchmarks with public test items. The fixes are structural: rolling cutoff, private test set, expert-curated rotation. Cross-check scores against contamination-resistant equivalents. Treat the headline benchmarks as a portfolio that needs to evolve every 2-3 years.

Reader Questions

Q.01What is benchmark contamination?+

Benchmark contamination is the presence of test-set items (or near-duplicates) in a model's pre-training corpus. When test items leak into training, the model can score correctly through memorisation rather than capability. The result is inflated benchmark scores that overstate genuine capability. Contamination has been documented for MMLU (test questions appear verbatim in Common Crawl), HumanEval (problems echo LeetCode solutions in pre-training data), MBPP (test items have been discussed extensively on the web), and many other widely-cited benchmarks.

Q.02How does contamination happen?+

Three pathways. First, direct verbatim leak: the test items (and sometimes their answers) appear in pre-training data because the benchmark dataset was published years before model training. Second, near-duplicate leak: paraphrased or stylistically variant versions of test items appear in training data. Third, indirect leak: training data includes discussions of the benchmark, its questions, or its solutions, even if the literal text differs. All three pathways inflate scores; the first is the easiest to detect, the third is the hardest.

Q.03How can I tell if a benchmark is contaminated?+

Three indirect signals. First, age: any benchmark released more than 18 months before a model's training cutoff is at risk. Second, public availability: benchmarks with publicly released test items are at higher risk than those with private test sets. Third, score patterns: if a model scores much higher on the benchmark than on conceptually-equivalent fresher benchmarks, contamination is a candidate explanation. Direct detection (searching pre-training corpora for test items) is harder and usually requires access to the training data, which most teams do not have.

Q.04What are the structural fixes for contamination?+

Three approaches that work. First, rolling-cutoff design: the benchmark only scores items released after the model's training cutoff (LiveCodeBench's design). Second, private test sets: test items are never published in plaintext, so they cannot leak into training corpora (GAIA's design). Third, expert-curated freshness: a small set of newly authored items released in periodic batches, with the older batches eventually retired (HLE's approach). All three are more robust than content-filtering approaches that try to detect and remove contaminated items after the fact.

Q.05Are vendor-published scores affected by contamination?+

Yes, sometimes substantially. The most-cited example is HumanEval: frontier models scored 95-99 percent in 2024-2026, partly because the 164 problems had been thoroughly discussed in pre-training data. The same models on contamination-resistant LiveCodeBench score 72-78 percent, a gap of 17-25 points that partly reflects contamination on HumanEval. The honest read of any benchmark score is to cross-check against a contamination-resistant equivalent: HumanEval against LiveCodeBench, MMLU against MMLU-Pro and HLE, MATH against AIME (newer problems each year).

Q.06Is contamination on agent benchmarks a problem too?+

Yes, but in different ways than knowledge benchmarks. Agent benchmarks like SWE-bench Verified and WebArena have execution-based success criteria, so memorising the test items does not directly help: the model still has to actually run the patch or perform the action. The contamination concern is more subtle: the model may have seen similar repository structures, similar issue patterns, or similar UI layouts in pre-training data, which gives an unfair familiarity advantage. The wider issue is documented for SWE-bench in particular at the original paper.

Site Methodology →What Benchmarks Miss →LiveCodeBench →GAIA Benchmark →Humanity's Last Exam →MMLU-Pro →HumanEval and MBPP →

Sources

[1] Sainz, O. et al. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. arXiv:2310.18018. Survey of contamination across benchmarks.
[2] Golchin, S. and Surdeanu, M. (2023). Time Travel in LLMs: Tracing Data Contamination in Large Language Models. arXiv:2308.08493. A detection methodology paper.
[3] Jain, N. et al. (2024). LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974. The rolling-cutoff design rationale.
[4] Mialon, G. et al. (2023). GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983. Private test-set design rationale.