Methodology: How We Verify Benchmark Scores
Per-benchmark source URLs, the six-question rubric we apply to every score cell, what is in scope, what is out of scope, and the corrections process. Read this before quoting any number from this site.
Source URLs per benchmark
Every benchmark cited on this site traces back to its official leaderboard or repository and its source paper. Refresh cadence is monthly for cluster-head benchmarks (the ones the site has the strongest visibility on in search) and quarterly for the rest. Saturated and reference-only benchmarks are quoted only for historical comparison.
| Benchmark | Leaderboard | Source | Cadence | Risk profile |
|---|---|---|---|---|
| MMLU | paperswithcode.com/dataset/mmlu | Hendrycks et al., 2020 | Reference (saturated) | High (saturated + contaminated) 57-domain multiple-choice knowledge test. Saturated across frontier models in 2026 at 92 to 94 percent. Documented contamination: test questions appear in Common Crawl. Quoted on this site for historical comparison only. |
| MMLU-Pro | huggingface.co/spaces/TIGER-Lab/MMLU-Pro | Wang et al., 2024 | Quarterly | Medium 10-choice extension of MMLU, CoT required, harder distractors. Used as the 2026 frontier knowledge benchmark. Less saturated than MMLU but contamination risk inherited from overlap with MMLU sources. |
| GPQA | github.com/idavidrein/gpqa | Rein et al., 2023 | Quarterly | Low (expert-written, hard to find online) Graduate-level science questions written by domain PhDs. Quoted as the parent set; the Diamond subset is the comparison cell on this site. |
| GPQA-Diamond | github.com/idavidrein/gpqa | Rein et al., 2023 | Monthly | Low 198-question subset, the hardest tier of GPQA. The canonical 2026 reasoning benchmark for frontier comparison. Frontier SOTA in the 71 to 78 percent range across Claude, GPT and Gemini families as of April 2026. |
| ARC-AGI | arcprize.org | Chollet, 2019 | Quarterly | Low (private holdout) Visual abstraction grid puzzles. The original benchmark is largely public; the Prize set has a held-out private subset used to verify claimed scores. |
| ARC-AGI-2 | arcprize.org | Chollet, 2025 | Quarterly | Low (private holdout) Harder successor designed to push beyond saturation of the original ARC-AGI. Private test set; scores verified by the ARC Prize team. |
| HumanEval | paperswithcode.com/sota/code-generation-on-humaneval | Chen et al., 2021 (OpenAI Codex paper) | Reference (saturated) | High (saturated + contaminated) 164 Python function-completion problems. Saturated; near-duplicates of LeetCode and Stack Overflow content appear in pre-training corpora. Quoted on this site for historical context only. |
| MBPP | paperswithcode.com/sota/code-generation-on-mbpp | Austin et al., 2021 | Reference (saturated) | High (saturated + contaminated) 974 entry-level Python tasks. Saturated. Same contamination dynamic as HumanEval. |
| LiveCodeBench | livecodebench.github.io | Jain et al., 2024 | Monthly | Low Anti-contamination coding benchmark: only includes problems released after the evaluated model's training cutoff. Active rotation. The current 2026 frontier coding benchmark for cross-model comparison. |
| SWE-bench Verified | www.swebench.com | Yang et al. 2023, Verified subset 2024 | Monthly | Medium (public git history; Verified subset reduces but does not eliminate) 500 human-verified GitHub issues from 12 Python repositories. The canonical 2026 coding-agent benchmark. Frontier SOTA in the low to mid 70s as of April 2026. Harness-dependent; the standard harness is the official SWE-agent. |
| AgentBench | github.com/THUDM/AgentBench | Liu et al., 2023 | Quarterly | Low to medium 1,091 tasks across eight environments (operating system, database, knowledge graph, card game, household, web shopping, web browsing, lateral thinking). Multi-environment agentic evaluation. |
| WebArena | webarena.dev | Zhou et al., 2024 | Quarterly | Low (sandboxed apps) 812 tasks across self-hosted GitLab, Reddit-style forum, e-commerce, content management system. Sandboxed: the apps run on infrastructure controlled by the benchmark team. |
| OSWorld | os-world.github.io | Xie et al., 2024 | Quarterly | Low (VM isolated) 369 computer-use tasks across LibreOffice, GIMP, Chrome, VS Code in a virtual machine. Evaluates desktop control: open files, edit spreadsheets, configure settings. SOTA around 38 percent as of April 2026. |
| Terminal-Bench | github.com/stanford-crfm/terminal-bench | Stanford CRFM, 2024 | Quarterly | Low (Docker isolated) Around 300 shell and DevOps tasks. Docker-isolated; success criterion is file-system state. |
| Tau-Bench | github.com/sierra-research/tau-bench | Sierra Research (Yao et al.), 2024 | Quarterly | Medium Around 200 tool-augmented natural language tasks (retail and airline customer service). Tests agent reliability on multi-turn tool-use with realistic ambiguity. |
| Chatbot Arena | lmsys.org | Chiang et al., 2024 (LMSYS) | Continuous (live Elo) | Low (live preference data, not a fixed test set) Crowd-sourced pairwise preference: users compare two anonymised model outputs side by side. Measures English-language open-ended preference, not narrow capability. Distinct ranking signal from MMLU or SWE-bench. |
| HELM | crfm.stanford.edu/helm/ | Stanford CRFM, Liang et al., 2022 | Quarterly | Low (aggregator) Holistic Evaluation of Language Models. Aggregates many sub-benchmarks under a single methodology. Useful as a meta-reference when comparing models across a wider suite than this site indexes individually. |
The six-question rubric
Every benchmark cell on this site is run through these six questions before publication. The rubric is also surfaced on the home page so readers can apply it to scores quoted elsewhere on the web.
- 01
What is the capture date?
Frontier benchmarks move weekly. A 2024 score is historical, not current. Every cell on this site is dated to the month captured.
- 02
What N-shot, what CoT, what temperature?
0-shot greedy, 5-shot CoT, and best-of-16 with extended tool use are different tests. Quote the setting alongside the score or do not quote it. Cells without setting are flagged methodology-incomplete.
- 03
Which test set version?
MMLU and MMLU-Pro are different tests with similar names. SWE-bench, SWE-bench Lite and SWE-bench Verified are different tests. GPQA and GPQA-Diamond are different tests. The names are easy to confuse.
- 04
Vendor model card or independent leaderboard?
First-party model cards optimise for the vendor's model. HuggingFace Open LLM Leaderboard v2 and Papers With Code are independent. Both have failure modes. Where vendor and independent disagree, the gap is shown and the independent source is preferred.
- 05
Public test set or held-out?
Public test sets carry contamination risk. MMLU questions appear in Common Crawl; HumanEval problems are near-duplicates of LeetCode solutions; SWE-bench issues have solutions in public git history. ARC Prize and LiveCodeBench use held-out problems to control for this.
- 06
What harness, what tools, what scaffold?
Agentic scores are harness-dependent. Best-of-16 with extended tool access is not the same number as greedy single-shot. The standard SWE-bench harness is the official SWE-agent; deviations from that are flagged on cells citing custom scaffolds.
In scope
- +Public benchmark SOTA scores: knowledge, coding, reasoning, agentic, multimodal, preference.
- +Methodology breakdown per benchmark: N-shot, CoT, temperature, max tokens, test set version, harness.
- +Saturation flags on benchmarks where frontier models cluster above 90 percent.
- +Contamination notes on benchmarks where test questions are known to appear in pre-training corpora.
- +Vendor-versus-independent leaderboard divergence where it exceeds the noise floor.
- +Practitioner stack: LLM-as-judge, RAG eval, custom eval design, production monitoring, eight evaluation platforms.
Out of scope
- -Proprietary in-house evals at vendor labs that are not publicly replicable.
- -Pre-publication frontier model claims that have not landed on an official leaderboard or peer-reviewed paper.
- -Enterprise eval contract terms, SLA performance numbers, or vendor-specific support latency.
- -Fine-tuning workflows beyond the evaluation surface; this site does not cover model training.
- -Agent runtime infrastructure (LangGraph or AutoGen execution semantics, retry logic, memory architecture).
- -Specific recommendations on which frontier model to use for which production workload; that requires custom evaluation on the user's own data.
Score-citation discipline
The site's default framing for any quoted SOTA is vendor-reported. Every score cell explains who reported the number, where it was reported, when it was captured, and under what evaluation setup. Where the official leaderboard publishes a verified score, the site quotes the leaderboard. Where only a vendor model card is available, the cell is flagged as vendor-reported and links to the model card directly, plus a "Verify on leaderboard" link to the official benchmark site so readers can cross-check.
The site does not independently re-run benchmark suites. Independent replication of frontier benchmarks (SWE-bench Verified, MMLU-Pro, GPQA-Diamond) requires non-trivial infrastructure: Docker containers per repository, harnessed test runners, evaluation budget across thousands of inference calls, and standardised methodology to make scores comparable. That work is out of scope here. Where independent leaderboards (HuggingFace Open LLM Leaderboard v2, Papers With Code, the ARC Prize verified-claim list, the LMSYS Chatbot Arena live Elo) publish their own numbers, those numbers are preferred over vendor model-card claims.
Saturation tagging
A benchmark is tagged saturated when frontier models cluster above 90 percent and the variance between them is within statistical noise. Saturation is not a failure mode in itself; it is a signal that the field has mastered the capability the benchmark was designed to measure. The problem is continuing to cite saturated benchmarks as if they discriminate between current frontier models.
As of May 2026, the site treats the following benchmarks as saturated for frontier-comparison purposes: MMLU, HumanEval, MBPP, HellaSwag, WinoGrande. These pages quote scores for historical comparison only and link readers to the current frontier benchmark for that capability (MMLU-Pro, LiveCodeBench, GPQA-Diamond, ARC-AGI-2, SWE-bench Verified).
Contamination tagging
Contamination risk is flagged per benchmark in the source table above. The risk profile depends on three factors: whether the test set is public, whether it overlaps with standard pre-training corpora (Common Crawl, GitHub, Stack Overflow, ArXiv, Wikipedia), and whether the benchmark team operates a held-out test set that vendors do not see.
High-risk benchmarks (MMLU, HumanEval, HellaSwag): test sets are fully public and known to overlap pre-training corpora. Allen AI found approximately 5 to 10 percent of MMLU test questions appear with high similarity in standard pre-training data. HumanEval problems are near-duplicates of LeetCode solutions widely indexed in pre-training.
Low-risk benchmarks (ARC-AGI, ARC-AGI-2, LiveCodeBench): use held-out test sets (ARC Prize private subset) or rotate problems after the model's training cutoff (LiveCodeBench), structurally preventing contamination by construction.
Medium-risk benchmarks (SWE-bench Verified): test data is public, but the construction (real GitHub issues with their patches in git history) is hard to avoid in pre-training. The Verified subset reduces this risk by curating issues where the patch reasoning is the harder part of the task, not the patch retrieval.
Refresh cadence
Monthly: the cluster-head benchmarks (SWE-bench Verified, MMLU-Pro, GPQA-Diamond, AgentBench, WebArena, LiveCodeBench) get a leaderboard SOTA check in the first business week of each month. If a new frontier score has landed since the previous check, the relevant page rolls forward.
Quarterly: full-suite verification of all 17 indexed benchmarks. Every cell is re-checked against the official leaderboard. Saturation status is re-evaluated. Capture dates are refreshed. The next quarterly pass closes in August 2026.
The verification date is held in a single constant (LAST_VERIFIED_DATE) in src/lib/schema.ts. Footer text, schema dateModified, and visible "Last verified" labels all read from that single source. This is a deliberate design choice so cosmetic-refresh leaks (rolling a date forward without doing the underlying verification work) are structurally prevented.
Corrections process
- 01Email editor@benchmarkingagents.com with the specific URL, the cell, the disputed claim, and the source you believe is correct.
- 02Five business days is the target response window. The acknowledgment includes whether the correction is accepted and when the next refresh pass will land it.
- 03Substantive corrections (mis-quoted score, wrong methodology setting, missing capture date, broken citation) trigger a page-level dateModified roll-forward and a short corrigendum note at the bottom of the page.
- 04Cosmetic corrections (typo, broken hyperlink, formatting glitch) roll silently in the next refresh pass without a dateModified update.
- 05Disputed-source claims (you say the vendor reported X percent, the leaderboard says Y percent) get a row in the methodology gap section of the affected benchmark page, showing both numbers and the citations.
Limitations
Benchmark scores measure narrow slices of capability under controlled conditions. They do not predict real-world agent performance, real-world reasoning, or real-world helpfulness. A 74.5 percent SWE-bench Verified score is an impressive engineering achievement; it is not a claim that the agent can ship production pull requests that a senior engineer would approve without modification.
Scores drift between verification windows. A score quoted on this site is correct as of its capture date, not as of the moment you read it. For decisions that depend on the absolute latest number, always click through to the official leaderboard link in the source table above.
Saturation and contamination are structural problems with the public-benchmark ecosystem, not artefacts of any single benchmark team. The most reliable signal for production model selection is custom evaluation on a golden dataset assembled from your own use case. The /custom-evals page covers how to build one.