Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Section I|Last verified April 2026|17 benchmarks indexed

The 2026 LLM Benchmark Reference: 17 Benchmarks with Capture-Dated Scores

A benchmark is a claim, not a fact. Every entry below carries category, source, and capture date.

The field has changed significantly since 2023. MMLU, HumanEval, and MBPP are saturated; frontier models cluster near the ceiling on all three, which makes them useless for comparing current models against each other. The action has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified. Each benchmark below is tagged with status and tier.

Table 1 · Master Index
BenchmarkCategoryYrTasks2026 TierStatus
MMLU-ProKnowledge202412,000high-80sActive
MMLUKnowledge202015,908low-90sSaturated
MMMUMultimodal202411,550high-70s to low-80sActive
Humanity's Last ExamKnowledge20253,000below 20%Active
HumanEvalCoding2021164high-90sSaturated
MBPPCoding2021974high-90sSaturated
LiveCodeBenchCoding2024Ongoingmid-70sActive
SWE-bench VerifiedCoding / Agentic2024500low-to-mid 70sActive
GPQA-DiamondReasoning2023198high-70sActive
ARC-AGI-2Reasoning2025400+mid-60sActive
BIG-Bench HardReasoning20226,511low-90sApproaching saturation
WebArenaAgentic2023812high-40sActive
AgentBenchAgentic20231,091low-50sActive
OSWorldAgentic2024369high-30sActive
Chatbot Arena EloPreference20232M+ votes1380 to 1400+Active
MathVistaMultimodal20246,141mid-80sActive
DROPReasoning201996,567low-90sSaturated
Tiers captured April 2026. They reflect typical frontier-model performance bands, not specific model+score pairings (which move week to week). For point-in-time scores, follow the source link on each row to the official leaderboard.
Section II

Knowledge Benchmarks

Category

Knowledge benchmarks test breadth of factual understanding across academic domains. MMLU (2020) was the defining benchmark of the 2020-2023 era and is now saturated; frontier models cluster in the low 90s, making it useless for current comparisons. MMLU-Pro addresses saturation with 10-choice questions and mandatory chain-of-thought, requiring more deliberate reasoning.

Humanity's Last Exam (Scale AI, 2025) is the current frontier challenge: 3,000 questions by domain experts, designed to remain hard for several years. Current SOTA sits below 20%, leaving substantial headroom. MMMU adds multimodal understanding; questions require image interpretation alongside text. For 2026 comparisons, the recommended stack is MMLU-Pro for text knowledge, MMMU for multimodal knowledge, and HLE for frontier limits.

Editor's verdictMost useful in 2026: MMLU-Pro. Cite MMLU only for historical comparison pre-2024.
Section III

Coding Benchmarks

Category

Code benchmarks evolved rapidly through 2021 to 2026. HumanEval (OpenAI, 2021) with 164 Python function-completion tasks was the standard for years. It is now saturated; top models exceed the high 90s on pass@1. MBPP (Google, 2021) with 974 basic Python problems is similarly exhausted. Neither benchmark belongs in a serious 2026 comparison.

LiveCodeBench (2024) addresses contamination by drawing from LeetCode and AtCoder problems released after each model's training cutoff, making contamination structurally unlikely rather than just hopefully absent. For 2026 code comparisons, use LiveCodeBench and SWE-bench Verified. The gap between SWE-bench Verified scores and HumanEval scores reveals how different code completion is from software engineering.

Editor's verdictMost useful in 2026: LiveCodeBench (anti-contamination), SWE-bench Verified (real engineering).
Section IV

Reasoning Benchmarks

Category

Reasoning is the benchmark frontier in 2026. GPQA (Rein et al., 2023) tests graduate-level, Google-proof scientific reasoning, with questions written by domain PhDs that are specifically designed to be impossible to answer by searching the web. GPQA-Diamond, the 198-question hardest subset, is the canonical reference. Frontier models cluster in the high 70s as of April 2026.

ARC-AGI-2 (Chollet, 2025) is the successor to the original Abstraction and Reasoning Corpus. ARC-AGI was cracked in late 2024 when reasoning-mode frontier models exceeded the human baseline. ARC-AGI-2 introduces significantly harder visual abstraction tasks that resist the techniques that broke the original. Current SOTA sits in the mid 60s. BIG-Bench Hard is approaching saturation.

Editor's verdictMost useful in 2026: GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam. All three retain real headroom.
Section V

Agentic Benchmarks

Category

Agentic benchmarks measure multi-step task completion with tool use, a fundamentally different capability from question answering. WebArena (2023) provides 812 tasks across real web applications running in sandboxed browser environments. The agent must navigate, fill forms, search, and complete tasks without human assistance. Current SOTA sits in the high 40s, with substantial room for improvement and a real performance gap between frontier models.

AgentBench (2023) covers 8 agent environments including OS tasks, game playing, database management, and web browsing, totalling 1,091 tasks. OSWorld (2024) focuses on computer-use agents that control desktop applications. SWE-bench Verified (listed in Coding above but fundamentally agentic) is the most widely cited.

Editor's verdictMost useful in 2026: SWE-bench Verified (coding agents), WebArena (web agents), OSWorld (computer use).

See agent benchmarks deep dive

Section VI

Multimodal Benchmarks

Category

Multimodal benchmarks have grown in importance as frontier models gained vision capability. MMMU (Yue et al., 2024) is the primary knowledge benchmark for multimodal models: college-exam questions that require understanding images alongside text, across 30 subjects. MathVista (2024) tests mathematical reasoning about visual content (geometry diagrams, data charts, scientific plots). ChartQA and DocVQA cover document understanding.

MMMU has become a standard inclusion in frontier model cards in 2026 because it captures a capability dimension that text-only benchmarks entirely miss. Top scores cluster in the high 70s to low 80s, with the gap between models narrowing.

Editor's verdictMost useful in 2026: MMMU for multimodal knowledge, MathVista for visual math reasoning.
Section VII

Human Preference Benchmarks

Category

LMSYS Chatbot Arena uses blind pairwise human voting to build a Bradley-Terry Elo ranking. Over 2 million votes as of April 2026 make it the most statistically robust human-preference signal available. The top three frontier models sit in a tight Elo band (low-to-high 1380s and into the 1400s) where many query-category gaps fall within statistical noise.

Chatbot Arena measures what humans prefer in open-ended conversation, a different dimension from correctness, coding ability, or agentic capability. A model can rank highly in Arena while performing only modestly on SWE-bench, and vice versa. Both are valid measures of different capabilities.

Editor's verdictMost useful in 2026: Chatbot Arena for preference, MT-Bench for instruction-following quality.

Human vs automated evaluation deep dive

Section VIII

Saturated · Historical Reference

Benchmarks here are still cited in 2026 but no longer discriminate frontier models. They remain useful as historical context, not as ranking criteria.

GLUE (2018)
Saturated 2019. Human baseline near 87%; top models hit 90.8% by 2020.
SuperGLUE (2019)
Saturated 2021. Human baseline 89.8%; models surpassed it in 2021.
MMLU (2020)
Saturated 2024. Frontier in the low 90s. Use MMLU-Pro instead.
HellaSwag (2019)
Saturated 2023. Frontier above 95%. Designed for 2019-era models.
WinoGrande (2019)
Saturated 2023. Frontier above 90%. Not useful for 2026 comparisons.
HumanEval (2021)
Saturated 2024. Frontier in the high 90s. Use LiveCodeBench instead.
MBPP (2021)
Saturated 2024. Top models above 97%. Same problem as HumanEval.
DROP (2019)
Approaching saturation; low 90s for frontier. Originally tested comprehension.
Section IX

Reader Questions

Q.01Which benchmark should I trust for frontier model comparisons in 2026?+
Use MMLU-Pro for knowledge (not MMLU, which is saturated). Use GPQA-Diamond for reasoning. Use SWE-bench Verified for coding and agentic capability. Use LMSYS Chatbot Arena for human preference. Avoid HumanEval, MMLU, HellaSwag, and MBPP for current comparisons because all are saturated.
Q.02How often do benchmarks get updated?+
Public benchmark test sets are typically static. What changes is the leaderboard. LMSYS Chatbot Arena updates continuously. Papers With Code aggregates scores as they are published. Some benchmarks release new versions (MMLU to MMLU-Pro, ARC-AGI to ARC-AGI-2) when the original saturates.
Q.03What does saturated mean for a benchmark?+
A benchmark is saturated when frontier models all score in a tight range above 90%, leaving the benchmark unable to discriminate between them. Saturated benchmarks remain useful as historical baselines but stop being useful for ranking current frontier models.
Q.04Why do not all frontier models publish all benchmarks?+
Vendors selectively publish benchmarks where they perform well. A model card that includes MMLU but omits SWE-bench Verified is a signal worth noting. Some benchmarks require expensive human evaluation that vendors cannot self-report. When a company skips a standard benchmark, ask why.
Agent benchmarks →What benchmarks miss →MMLU deep dive →SWE-bench deep dive →Human vs automated →

Sources

  1. [1] HuggingFace Open LLM Leaderboard v2 · huggingface.co · Captured April 2026
  2. [2] Papers With Code MMLU · paperswithcode.com · Captured April 2026
  3. [3] SWE-bench Official Leaderboard · swebench.com · Captured April 2026
  4. [4] ARC Prize / ARC-AGI-2 · arcprize.org · Captured April 2026
  5. [5] LMSYS Chatbot Arena · lmsys.org · Captured April 2026
  6. [6] LiveCodeBench · livecodebench.github.io · Captured April 2026
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.