Section I|Last verified April 2026|17 benchmarks indexed

The 2026 LLM Benchmark Reference: 17 Benchmarks with Capture-Dated Scores

Q: Which benchmark should I trust for frontier model comparisons in 2026?

Use MMLU-Pro for knowledge (not MMLU, which is saturated for the frontier tier). Use GPQA-Diamond for reasoning. Use SWE-bench Verified for coding and agentic capability. Use LMSYS Chatbot Arena for human preference. Avoid HumanEval, MMLU, HellaSwag, and MBPP for current comparisons because all are saturated.

Q: How often do benchmarks get updated?

Public benchmark test sets are typically static; updating them would invalidate historical comparisons. What changes is the leaderboard, as new models submit scores. LMSYS Chatbot Arena updates continuously. Papers With Code aggregates scores as they are published. Some benchmarks release new versions (MMLU to MMLU-Pro, ARC-AGI to ARC-AGI-2) when the original saturates.

Q: Why do not all frontier models publish all benchmarks?

Model vendors selectively publish benchmarks where they perform well. A model card that includes MMLU but omits SWE-bench Verified is a signal worth noting. Some benchmarks require expensive human evaluation (LMSYS Arena) that vendors cannot self-report. When a company skips a standard benchmark, it is reasonable to ask why.

A benchmark is a claim, not a fact. Every entry below carries category, source, and capture date.

The field has changed significantly since 2023. MMLU, HumanEval, and MBPP are saturated; frontier models cluster near the ceiling on all three, which makes them useless for comparing current models against each other. The action has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified. Each benchmark below is tagged with status and tier.

Table 1 · Master Index

Benchmark	Category	Yr	Tasks	2026 Tier	Status
MMLU-Pro	Knowledge	2024	12,000	high-80s	Active
MMLU	Knowledge	2020	15,908	low-90s	Saturated
MMMU	Multimodal	2024	11,550	high-70s to low-80s	Active
Humanity's Last Exam	Knowledge	2025	3,000	below 20%	Active
HumanEval	Coding	2021	164	high-90s	Saturated
MBPP	Coding	2021	974	high-90s	Saturated
LiveCodeBench	Coding	2024	Ongoing	mid-70s	Active
SWE-bench Verified	Coding / Agentic	2024	500	low-to-mid 70s	Active
GPQA-Diamond	Reasoning	2023	198	high-70s	Active
ARC-AGI-2	Reasoning	2025	400+	mid-60s	Active
BIG-Bench Hard	Reasoning	2022	6,511	low-90s	Approaching saturation
WebArena	Agentic	2023	812	high-40s	Active
AgentBench	Agentic	2023	1,091	low-50s	Active
OSWorld	Agentic	2024	369	high-30s	Active
Chatbot Arena Elo	Preference	2023	2M+ votes	1380 to 1400+	Active
MathVista	Multimodal	2024	6,141	mid-80s	Active
DROP	Reasoning	2019	96,567	low-90s	Saturated
Tiers captured April 2026. They reflect typical frontier-model performance bands, not specific model+score pairings (which move week to week). For point-in-time scores, follow the source link on each row to the official leaderboard.

Section II

Knowledge Benchmarks

Category

Knowledge benchmarks test breadth of factual understanding across academic domains. MMLU (2020) was the defining benchmark of the 2020-2023 era and is now saturated; frontier models cluster in the low 90s, making it useless for current comparisons. MMLU-Pro addresses saturation with 10-choice questions and mandatory chain-of-thought, requiring more deliberate reasoning.

Humanity's Last Exam (Scale AI, 2025) is the current frontier challenge: 3,000 questions by domain experts, designed to remain hard for several years. Current SOTA sits below 20%, leaving substantial headroom. MMMU adds multimodal understanding; questions require image interpretation alongside text. For 2026 comparisons, the recommended stack is MMLU-Pro for text knowledge, MMMU for multimodal knowledge, and HLE for frontier limits.

Editor's verdictMost useful in 2026: MMLU-Pro. Cite MMLU only for historical comparison pre-2024.

Section III

Coding Benchmarks

Category

Code benchmarks evolved rapidly through 2021 to 2026. HumanEval (OpenAI, 2021) with 164 Python function-completion tasks was the standard for years. It is now saturated; top models exceed the high 90s on pass@1. MBPP (Google, 2021) with 974 basic Python problems is similarly exhausted. Neither benchmark belongs in a serious 2026 comparison.

LiveCodeBench (2024) addresses contamination by drawing from LeetCode and AtCoder problems released after each model's training cutoff, making contamination structurally unlikely rather than just hopefully absent. For 2026 code comparisons, use LiveCodeBench and SWE-bench Verified. The gap between SWE-bench Verified scores and HumanEval scores reveals how different code completion is from software engineering.

Editor's verdictMost useful in 2026: LiveCodeBench (anti-contamination), SWE-bench Verified (real engineering).

Section IV

Reasoning Benchmarks

Category

Reasoning is the benchmark frontier in 2026. GPQA (Rein et al., 2023) tests graduate-level, Google-proof scientific reasoning, with questions written by domain PhDs that are specifically designed to be impossible to answer by searching the web. GPQA-Diamond, the 198-question hardest subset, is the canonical reference. Frontier models cluster in the high 70s as of April 2026.

ARC-AGI-2 (Chollet, 2025) is the successor to the original Abstraction and Reasoning Corpus. ARC-AGI was cracked in late 2024 when reasoning-mode frontier models exceeded the human baseline. ARC-AGI-2 introduces significantly harder visual abstraction tasks that resist the techniques that broke the original. Current SOTA sits in the mid 60s. BIG-Bench Hard is approaching saturation.

Editor's verdictMost useful in 2026: GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam. All three retain real headroom.

Section V

Agentic Benchmarks

Category

Agentic benchmarks measure multi-step task completion with tool use, a fundamentally different capability from question answering. WebArena (2023) provides 812 tasks across real web applications running in sandboxed browser environments. The agent must navigate, fill forms, search, and complete tasks without human assistance. Current SOTA sits in the high 40s, with substantial room for improvement and a real performance gap between frontier models.

AgentBench (2023) covers 8 agent environments including OS tasks, game playing, database management, and web browsing, totalling 1,091 tasks. OSWorld (2024) focuses on computer-use agents that control desktop applications. SWE-bench Verified (listed in Coding above but fundamentally agentic) is the most widely cited.

Editor's verdictMost useful in 2026: SWE-bench Verified (coding agents), WebArena (web agents), OSWorld (computer use).

See agent benchmarks deep dive →

Section VI

Multimodal Benchmarks

Category

Multimodal benchmarks have grown in importance as frontier models gained vision capability. MMMU (Yue et al., 2024) is the primary knowledge benchmark for multimodal models: college-exam questions that require understanding images alongside text, across 30 subjects. MathVista (2024) tests mathematical reasoning about visual content (geometry diagrams, data charts, scientific plots). ChartQA and DocVQA cover document understanding.

MMMU has become a standard inclusion in frontier model cards in 2026 because it captures a capability dimension that text-only benchmarks entirely miss. Top scores cluster in the high 70s to low 80s, with the gap between models narrowing.

Editor's verdictMost useful in 2026: MMMU for multimodal knowledge, MathVista for visual math reasoning.

Section VII

Human Preference Benchmarks

Category

LMSYS Chatbot Arena uses blind pairwise human voting to build a Bradley-Terry Elo ranking. Over 2 million votes as of April 2026 make it the most statistically robust human-preference signal available. The top three frontier models sit in a tight Elo band (low-to-high 1380s and into the 1400s) where many query-category gaps fall within statistical noise.

Chatbot Arena measures what humans prefer in open-ended conversation, a different dimension from correctness, coding ability, or agentic capability. A model can rank highly in Arena while performing only modestly on SWE-bench, and vice versa. Both are valid measures of different capabilities.

Editor's verdictMost useful in 2026: Chatbot Arena for preference, MT-Bench for instruction-following quality.

Human vs automated evaluation deep dive →

Section VIII

Saturated · Historical Reference

Benchmarks here are still cited in 2026 but no longer discriminate frontier models. They remain useful as historical context, not as ranking criteria.

GLUE (2018)

Saturated 2019. Human baseline near 87%; top models hit 90.8% by 2020.

SuperGLUE (2019)

Saturated 2021. Human baseline 89.8%; models surpassed it in 2021.

MMLU (2020)

Saturated 2024. Frontier in the low 90s. Use MMLU-Pro instead.

HellaSwag (2019)

Saturated 2023. Frontier above 95%. Designed for 2019-era models.

WinoGrande (2019)

Saturated 2023. Frontier above 90%. Not useful for 2026 comparisons.

HumanEval (2021)

Saturated 2024. Frontier in the high 90s. Use LiveCodeBench instead.

MBPP (2021)

Saturated 2024. Top models above 97%. Same problem as HumanEval.

DROP (2019)

Approaching saturation; low 90s for frontier. Originally tested comprehension.

Section IX

Reader Questions

Q.01Which benchmark should I trust for frontier model comparisons in 2026?+

Use MMLU-Pro for knowledge (not MMLU, which is saturated). Use GPQA-Diamond for reasoning. Use SWE-bench Verified for coding and agentic capability. Use LMSYS Chatbot Arena for human preference. Avoid HumanEval, MMLU, HellaSwag, and MBPP for current comparisons because all are saturated.

Q.02How often do benchmarks get updated?+

Public benchmark test sets are typically static. What changes is the leaderboard. LMSYS Chatbot Arena updates continuously. Papers With Code aggregates scores as they are published. Some benchmarks release new versions (MMLU to MMLU-Pro, ARC-AGI to ARC-AGI-2) when the original saturates.

Q.03What does saturated mean for a benchmark?+

A benchmark is saturated when frontier models all score in a tight range above 90%, leaving the benchmark unable to discriminate between them. Saturated benchmarks remain useful as historical baselines but stop being useful for ranking current frontier models.

Q.04Why do not all frontier models publish all benchmarks?+

Vendors selectively publish benchmarks where they perform well. A model card that includes MMLU but omits SWE-bench Verified is a signal worth noting. Some benchmarks require expensive human evaluation that vendors cannot self-report. When a company skips a standard benchmark, ask why.

Agent benchmarks →What benchmarks miss →MMLU deep dive →SWE-bench deep dive →Human vs automated →

Sources

[1] HuggingFace Open LLM Leaderboard v2 · huggingface.co · Captured April 2026
[2] Papers With Code MMLU · paperswithcode.com · Captured April 2026
[3] SWE-bench Official Leaderboard · swebench.com · Captured April 2026
[4] ARC Prize / ARC-AGI-2 · arcprize.org · Captured April 2026
[5] LMSYS Chatbot Arena · lmsys.org · Captured April 2026
[6] LiveCodeBench · livecodebench.github.io · Captured April 2026