Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
Last verified April 2026 - 17 benchmarks

The 2026 LLM Benchmark Reference - 17 Benchmarks with Current Scores

A benchmark is a claim, not a fact. Every table on this page carries the source, capture date, and known limitations. This is the most comprehensive, dated, source-linked benchmark index on the open web for 2026.

The field has changed significantly since 2023. MMLU, HumanEval, and MBPP are saturated - frontier models score 92-98% on all three, making them useless for comparing current models. The action has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified. Each benchmark below is tagged with its current status.

BenchmarkCategoryYearTasksSOTASOTA ModelCapturedStatus
MMLU-ProKnowledge202412,00086.1%GPT-5Apr 2026Active
MMLUKnowledge202015,90893.4%GPT-5Apr 2026Saturated
MMMUMultimodal202411,55081.2%GPT-5Apr 2026Active
Humanity's Last ExamKnowledge20253,00018.4%GPT-5Apr 2026Active
HumanEvalCode202116498.1%GPT-5Apr 2026Saturated
MBPPCode202197497.4%Claude 4.5 OpusApr 2026Saturated
LiveCodeBenchCode2024Ongoing74.8%GPT-5Apr 2026Active
SWE-bench VerifiedCode/Agentic202450074.5%Claude 4.5 OpusApr 2026Active
GPQA-DiamondReasoning202319878.4%GPT-5Apr 2026Active
ARC-AGI-2Reasoning2025400+65.8%GPT-5Apr 2026Active
BIG-Bench HardReasoning20226,51194.3%GPT-5Apr 2026Approaching saturation
WebArenaAgentic202381247.2%Claude 4.5 OpusApr 2026Active
AgentBenchAgentic20231,09154.3%GPT-5Apr 2026Active
OSWorldAgentic202436938.1%Claude 4.5 OpusApr 2026Active
Chatbot Arena EloHuman preference20232M+ votes1401GPT-5Apr 2026Active
MathVistaMultimodal20246,14184.7%GPT-5Apr 2026Active
DROPReasoning201996,56794.1%GPT-5Apr 2026Saturated
All scores captured April 2026. Sources: vendor model cards, Papers With Code, HuggingFace Open LLM Leaderboard v2, official leaderboards. Benchmark names link to primary sources.

Knowledge Benchmarks

Knowledge benchmarks test breadth of factual understanding across academic domains. MMLU (2020) was the defining benchmark of the 2020-2023 era and is now saturated - frontier models score 92-94%, making it useless for current comparisons. MMLU-Pro addresses saturation with 10-choice questions and mandatory CoT, requiring more deliberate reasoning. Humanity's Last Exam (Scale AI, 2025) is the current frontier challenge: 3,000 questions by domain experts, designed to remain hard for several years. Current SOTA is around 18-20%, leaving enormous headroom.

MMMU adds multimodal understanding - questions require image interpretation alongside text. College-exam style across 30 subjects, it captures capability that text-only benchmarks entirely miss. For 2026 model comparisons, the recommended stack is: MMLU-Pro for text knowledge, MMMU for multimodal knowledge, and HLE for frontier reasoning limits.

Most useful in 2026: MMLU-Pro. Cite MMLU only for historical comparison pre-2024.

Code Benchmarks

Code benchmarks evolved rapidly through 2021-2026. HumanEval (OpenAI, 2021) with 164 Python function-completion tasks was the standard for years. It is now saturated: top models exceed 98% pass@1. MBPP (Google, 2021) with 974 basic Python problems is similarly exhausted. Neither benchmark should appear in a serious 2026 model comparison.

LiveCodeBench (2024) solved the contamination problem by drawing from LeetCode and AtCoder problems released after each model's training cutoff. This makes contamination structurally impossible rather than just unlikely. For 2026 code comparisons, use LiveCodeBench and SWE-bench Verified. SWE-bench Verified tests real-world software engineering capability - can the model write a patch for a real GitHub issue that makes failing tests pass? The gap between SWE-bench Verified scores and HumanEval scores reveals how different code completion is from software engineering.

Most useful in 2026: LiveCodeBench (anti-contamination), SWE-bench Verified (real engineering).

Reasoning Benchmarks

Reasoning is the benchmark frontier in 2026. GPQA (Rein et al., 2023) tests graduate-level, Google-proof scientific reasoning - questions written by domain PhDs that are specifically designed to be impossible to answer by searching the web. GPQA-Diamond, the 198-question hardest subset, is the canonical reference. Frontier models now score 71-78%, with GPT-5 reaching 78.4% as of April 2026.

ARC-AGI-2 (Chollet, 2025) is the successor to the original Abstraction and Reasoning Corpus. ARC-AGI was cracked in December 2024 when OpenAI's o3 scored 87.5% (above the 85% human baseline). ARC-AGI-2 introduces significantly harder visual abstraction tasks that resist the techniques that broke the original. SOTA is around 65.8% as of April 2026, with substantial headroom. BIG-Bench Hard is approaching saturation at 94.3%.

Most useful in 2026: GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam. All have real headroom.

Agentic Benchmarks

Agentic benchmarks measure multi-step task completion with tool use - a fundamentally different capability from question answering. WebArena (2023) provides 812 tasks across real web applications (Reddit, GitLab, e-commerce) running in sandboxed browser environments. The agent must navigate, fill forms, search, and complete tasks without human assistance. SOTA is around 47%, suggesting substantial room for improvement and a real performance gap between frontier models.

AgentBench (2023) covers 8 agent environments including OS tasks, game playing, database management, and web browsing - 1,091 tasks in total. OSWorld (2024) focuses on computer use agents - can the model control a desktop application to complete a task? SWE-bench Verified (listed in code above but fundamentally agentic) is the most widely cited.

Most useful in 2026: SWE-bench Verified (coding agents), WebArena (web agents), OSWorld (computer use).

See agent benchmarks deep dive ->

Multimodal Benchmarks

Multimodal benchmarks have grown in importance as frontier models gained vision capability. MMMU (Yue et al., 2024) is the primary knowledge benchmark for multimodal models - college-exam questions that require understanding images alongside text, across 30 subjects. MathVista (2024) tests mathematical reasoning about visual content: geometry diagrams, data charts, and scientific plots. ChartQA and DocVQA cover document understanding.

MMMU has become a standard inclusion in frontier model cards in 2026 because it captures a capability dimension that text-only benchmarks entirely miss. GPT-5 leads at 81.2% as of April 2026, but the gap between models is narrowing.

Most useful in 2026: MMMU for multimodal knowledge, MathVista for visual math reasoning.

Human Preference Benchmarks

LMSYS Chatbot Arena uses blind pairwise human voting to build a Bradley-Terry Elo ranking. Over 2 million votes as of April 2026, making it the most statistically robust human-preference signal available. GPT-5 leads at Elo 1401, with Claude 4.5 Opus at 1389 and Gemini 2.5 Pro at 1371. The gap between the top three is within statistical noise for many query categories.

Chatbot Arena measures what humans prefer in open-ended conversation - a different dimension from correctness, coding ability, or agentic capability. A model can rank highly in Arena while performing poorly on SWE-bench, and vice versa. Both are valid measures of different capabilities.

Most useful in 2026: Chatbot Arena for preference, MT-Bench for instruction-following quality.

Human vs automated evaluation deep dive ->

Saturated Benchmarks - Historical Reference Only

GLUE (2018)Saturated 2019. Human baseline ~87%, top models hit 90.8% by 2020.
SuperGLUE (2019)Saturated 2021. Human baseline 89.8%, models surpassed it in 2021.
MMLU (2020)Saturated 2024. Frontier models 92-94%. Use MMLU-Pro instead.
HellaSwag (2019)Saturated 2023. Frontier models 95%+. Designed for 2019 models.
WinoGrande (2019)Saturated 2023. Frontier models 90%+. Not useful for 2026 comparisons.
HumanEval (2021)Saturated 2024. Frontier models 96-98%. Use LiveCodeBench instead.
MBPP (2021)Saturated 2024. Top models 97%+. Same problem as HumanEval.
DROP (2019)Approaching saturation. 94%+ for frontier models. Originally tested comprehension.

Frequently Asked Questions

Which benchmark should I trust for frontier model comparisons in 2026?+
Use MMLU-Pro for knowledge (not MMLU, which is saturated at 92-94% for frontier models). Use GPQA-Diamond for reasoning. Use SWE-bench Verified for coding and agentic capability. Use LMSYS Chatbot Arena for human preference. Avoid HumanEval, MMLU, HellaSwag, and MBPP for current comparisons - all are saturated.
How often do benchmarks get updated?+
Public benchmark test sets are typically static (updating them would invalidate historical comparisons). What changes is the leaderboard - new models submit scores. LMSYS Chatbot Arena updates continuously. Papers With Code aggregates scores as they are published. Some benchmarks release new versions (MMLU to MMLU-Pro, ARC-AGI to ARC-AGI-2) when the original saturates.
What does saturated mean for a benchmark?+
A benchmark is saturated when frontier models all score above approximately 90%, making the benchmark unable to discriminate between them. MMLU, HumanEval, HellaSwag, and MBPP are all saturated. Saturated benchmarks are still useful as historical baselines but useless for ranking current models against each other.
Why don't all frontier models publish all benchmarks?+
Model vendors selectively publish benchmarks where they perform well. A model card that includes MMLU but omits SWE-bench Verified is a signal worth noting. Some benchmarks require expensive human evaluation (LMSYS Arena) that vendors cannot self-report. When a company skips a standard benchmark, it is reasonable to ask why.
Agent benchmarks ->What benchmarks miss ->MMLU deep dive ->SWE-bench deep dive ->Human vs automated ->

Sources

  1. [1] HuggingFace Open LLM Leaderboard v2 - huggingface.co/spaces/open-llm-leaderboard - Captured April 2026
  2. [2] Papers With Code MMLU - paperswithcode.com - Captured April 2026
  3. [3] SWE-bench Official Leaderboard - swebench.com - Captured April 2026
  4. [4] ARC Prize / ARC-AGI-2 - arcprize.org - Captured April 2026
  5. [5] LMSYS Chatbot Arena - lmsys.org - Captured April 2026
  6. [6] LiveCodeBench - livecodebench.github.io - Captured April 2026