Prompt Template Variance: Same Model, Same Benchmark, 10+ Point Swing
Surface-form choice is a confound that swamps model differences across most published benchmarks.
The Finding
The Sclar et al. paper held the model and the benchmark fixed, varied only the prompt template (separator characters, newline patterns, label format), and measured accuracy. On Llama-2-7B, MMLU accuracy varied between 6.2% and 82.4% from template choice alone. On larger and instruction-tuned models, the swing narrowed but stayed in the 5 to 15 point range. The effect was not curated for shock value; it generalises across model families and benchmark types.
Why This Matters For Reading Benchmarks
Two papers reporting MMLU scores for "the same model" can differ by 10 to 15 points even when both are honest. One uses the official MMLU template; the other uses a slightly different format. The number printed in the paper is not a property of the model; it is a property of the model under the specific prompt template the paper authors used. Cross-paper comparisons that ignore template choice are not making the comparison they claim.
Defensive Practice
Three things to do. First, when comparing two models on the same benchmark, run both under the same template (do not trust the paper's number). Second, when reporting a single model number, report a mean and variance across several plausible templates, not a single point estimate. Third, prefer benchmarks with constrained scoring (HumanEval code execution, MATH expression match) over benchmarks with surface-form extraction (MMLU letter match) when prompt-template noise would dominate the comparison.
Q.01What did Sclar et al. find?+
Q.02Is this a few-shot artefact or a base-model artefact?+
Q.03How do I defend against template variance?+
Q.04Does this affect agent benchmarks?+
Q.05Which benchmark numbers are most affected?+
Sources
- [1] Sclar et al. (2024): arxiv.org/abs/2310.11324
- [2] HELM methodology: crfm.stanford.edu/helm
- [3] Lu et al. (2022) on prompt sensitivity precursor: arxiv.org/abs/2104.08786