Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatEmpirical finding that benchmark scores swing by up to 76 points from prompt-template choice alone.
WhoSclar, Choi, Tsvetkov, Suhr (ICLR 2024).
2026 UseDefensive practice for any cross-paper benchmark comparison.
Paperarxiv.org/abs/2310.11324
Section IV.iv Methodology|Last verified April 2026

Prompt Template Variance: Same Model, Same Benchmark, 10+ Point Swing

Surface-form choice is a confound that swamps model differences across most published benchmarks.

I

The Finding

The Sclar et al. paper held the model and the benchmark fixed, varied only the prompt template (separator characters, newline patterns, label format), and measured accuracy. On Llama-2-7B, MMLU accuracy varied between 6.2% and 82.4% from template choice alone. On larger and instruction-tuned models, the swing narrowed but stayed in the 5 to 15 point range. The effect was not curated for shock value; it generalises across model families and benchmark types.

II

Why This Matters For Reading Benchmarks

Two papers reporting MMLU scores for "the same model" can differ by 10 to 15 points even when both are honest. One uses the official MMLU template; the other uses a slightly different format. The number printed in the paper is not a property of the model; it is a property of the model under the specific prompt template the paper authors used. Cross-paper comparisons that ignore template choice are not making the comparison they claim.

III

Defensive Practice

Three things to do. First, when comparing two models on the same benchmark, run both under the same template (do not trust the paper's number). Second, when reporting a single model number, report a mean and variance across several plausible templates, not a single point estimate. Third, prefer benchmarks with constrained scoring (HumanEval code execution, MATH expression match) over benchmarks with surface-form extraction (MMLU letter match) when prompt-template noise would dominate the comparison.

Reproducibility failurespass@k methodologySite methodology overview
Reader Questions
Q.01What did Sclar et al. find?+
Sclar, Choi, Tsvetkov, Suhr (ICLR 2024) ran a single model against several benchmarks under many semantically-equivalent prompt templates: different separator characters, different newline patterns, different surface-form formats. They observed up to 76 accuracy-point swings on the same model and benchmark from prompt-template choice alone. The paper concluded that benchmark scores are dependent on prompt template at a degree that invalidates most cross-paper comparisons.
Q.02Is this a few-shot artefact or a base-model artefact?+
Both. The original paper focused on few-shot prompts but the effect persists in zero-shot. Even canonical template choices like 'Answer: A' vs 'A.' vs 'The answer is A' shift accuracy by several points. Larger models are slightly more robust but the effect is still present at frontier scale.
Q.03How do I defend against template variance?+
Two approaches. First, evaluate the same model under several templates and report mean and variance, not a single number. The HELM benchmark from Stanford CRFM does this by default. Second, use a structured prompt format with constrained decoding (logit biasing or JSON-mode), which removes most surface-form variance at the cost of a small accuracy drop.
Q.04Does this affect agent benchmarks?+
Yes, sometimes worse. Agent benchmarks add tool-call templates, system-prompt format, and inter-step message format on top of the basic prompt template. Sclar-style sensitivity analyses on agent benchmarks (Yang et al. 2024 follow-up) find swings of 5 to 15 points from system-prompt choice alone.
Q.05Which benchmark numbers are most affected?+
Multiple-choice benchmarks with letter-answer extraction (MMLU, ARC, BBH) are the most affected because the answer is extracted by string match on a short surface form. Generation benchmarks with strict scoring (HumanEval, MATH) are less affected because the scoring runs the code or evaluates the math, not the surface form. Free-text benchmarks judged by LLM-as-judge sit in between.

Sources

  1. [1] Sclar et al. (2024): arxiv.org/abs/2310.11324
  2. [2] HELM methodology: crfm.stanford.edu/helm
  3. [3] Lu et al. (2022) on prompt sensitivity precursor: arxiv.org/abs/2104.08786
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.