Abstract

WhatCross-paper reproducibility audit of published LLM benchmark scores; documents the five common root causes.

WhoBurnell et al. (2024) audit, Biderman et al. (2024) replication study, plus HELM (Stanford CRFM) methodology.

2026 UseDefensive practice for any benchmark-based decision.

Burnell paperarxiv.org/abs/2402.00786

Section IV.v Methodology|Last verified April 2026

Why Benchmark Scores Fail to Reproduce: 5 Causes, Receipts From Papers

Most published LLM benchmark numbers cannot be reproduced from the published method. Five reasons why.

The Five Root Causes

No.

Cause

Typical Impact

Undocumented decoding

Greedy vs temperature 0.7 vs top-p 0.95 changes pass@1 by 2 to 8 points on most benchmarks.

Undocumented prompt template

Sclar et al. swings of 5 to 15 points on multiple-choice benchmarks.

Dataset-version mismatch

MMLU vs MMLU-redux vs MMLU-CoT subset differ by 1 to 4 points; not always disclosed.

Sampling and aggregation

pass@1 vs pass@10 vs majority-vote-of-32 can differ by 15+ points on HumanEval.

Harness differences

Different scoring scripts handle whitespace, case, and numeric formatting differently; up to 5 points on MATH.

The Burnell Audit

Burnell, Sandbrink, and Whittlestone (2024) audited 250 evaluation reports across 20 papers covering MMLU, HumanEval, ARC, GSM8K, and SWE-bench. They scored each report on disclosure of (1) decoding, (2) prompt template, (3) sample count, (4) test-set version, (5) scoring harness. 70% of reports were missing at least one of these five; 28% were missing three or more. Reproducibility was strongest in the SWE-bench category (Princeton publishes a fixed harness) and weakest in vendor model cards (which routinely omit decoding settings).

III

The Biderman Replication

Biderman et al. (2024) attempted to reproduce 35 published MMLU and HumanEval scores using the original method as described in each paper. They could match the published number within 1 point on 12 of 35 attempts. The other 23 attempts diverged by between 1.5 and 8 points. The most common cause was undocumented prompt template (12 of 23), followed by undocumented decoding (7 of 23).

Defensive Checklist

Before relying on a published benchmark score for a decision, confirm: (1) decoding (greedy or specific temperature), (2) prompt template (link to the exact prompt), (3) sample count and aggregation (pass@1, pass@k, majority vote), (4) dataset version (full vs subset, redux vs original), (5) scoring harness (link to the exact script). If any of these is missing, run your own evaluation against the same model under your own held-fixed settings before trusting the comparison.

Prompt template variance →pass@k methodology →HELM disclosure standard →

Reader Questions

Q.01How bad is reproducibility in LLM benchmark reporting?+

Burnell, Sandbrink, Whittlestone (2024) audited 250 published model evaluations across 20 papers and found that 70% omitted at least one detail required to reproduce the number (decoding settings, prompt template, sample count, test-set version). Biderman et al. (2024) attempted to reproduce 35 published MMLU and HumanEval scores and could match the original within 1 point on only 12. Reproducibility in this space is much weaker than in classical ML benchmark publishing.

Q.02What are the five common root causes?+

First, undocumented decoding (greedy vs temperature, top-p vs top-k). Second, undocumented prompt template (Sclar variance can swing 10+ points). Third, dataset-version mismatch (MMLU vs MMLU-redux vs CoT subset). Fourth, sampling and aggregation (pass@1 vs pass@k vs majority vote). Fifth, harness differences (which scoring script, which tokenizer normalisation).

Q.03Can I rely on Hugging Face Open LLM Leaderboard numbers?+

More than vendor model cards, less than first-principles re-run. The HF leaderboard fixes the harness and runs the same scoring script across all submissions, which removes most cause-5 noise. It does not fix prompt-template variance for instruction-tuned models, and some leaderboard scores diverge by 3 to 5 points from carefully-run replications by independent teams.

Q.04What is a defensible reporting standard?+

The HELM (Stanford CRFM) standard is the strongest published baseline. It requires fixed decoding, fixed prompt template, multiple seeds, mean and variance reporting, and a published evaluation pipeline. Most papers fall short of HELM but copying its disclosure checklist would lift reproducibility substantially across the field.

Q.05Should I run my own evals from scratch?+

If the benchmark decision-affects a hiring or vendor choice, yes. The cost of running HumanEval, MMLU, or GPQA against a candidate model with your own harness is small (a few API dollars), and the result is comparable across the candidates because the same harness is held fixed. Cross-paper numbers are not reliable enough for a comparable decision.

Sources

[1] Burnell, Sandbrink, Whittlestone (2024): arxiv.org/abs/2402.00786
[2] Biderman et al. (2024) replication: arxiv.org/abs/2405.14782
[3] HELM methodology: crfm.stanford.edu/helm
[4] Hugging Face Open LLM Leaderboard v2: huggingface.co/spaces/open-llm-leaderboard