The 2026 LLM Benchmark Reference: 17 Benchmarks with Capture-Dated Scores
A benchmark is a claim, not a fact. Every entry below carries category, source, and capture date.
The field has changed significantly since 2023. MMLU, HumanEval, and MBPP are saturated; frontier models cluster near the ceiling on all three, which makes them useless for comparing current models against each other. The action has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, Humanity's Last Exam, and SWE-bench Verified. Each benchmark below is tagged with status and tier.
| Benchmark | Category | Yr | Tasks | 2026 Tier | Status |
|---|---|---|---|---|---|
| MMLU-Pro | Knowledge | 2024 | 12,000 | high-80s | Active |
| MMLU | Knowledge | 2020 | 15,908 | low-90s | Saturated |
| MMMU | Multimodal | 2024 | 11,550 | high-70s to low-80s | Active |
| Humanity's Last Exam | Knowledge | 2025 | 3,000 | below 20% | Active |
| HumanEval | Coding | 2021 | 164 | high-90s | Saturated |
| MBPP | Coding | 2021 | 974 | high-90s | Saturated |
| LiveCodeBench | Coding | 2024 | Ongoing | mid-70s | Active |
| SWE-bench Verified | Coding / Agentic | 2024 | 500 | low-to-mid 70s | Active |
| GPQA-Diamond | Reasoning | 2023 | 198 | high-70s | Active |
| ARC-AGI-2 | Reasoning | 2025 | 400+ | mid-60s | Active |
| BIG-Bench Hard | Reasoning | 2022 | 6,511 | low-90s | Approaching saturation |
| WebArena | Agentic | 2023 | 812 | high-40s | Active |
| AgentBench | Agentic | 2023 | 1,091 | low-50s | Active |
| OSWorld | Agentic | 2024 | 369 | high-30s | Active |
| Chatbot Arena Elo | Preference | 2023 | 2M+ votes | 1380 to 1400+ | Active |
| MathVista | Multimodal | 2024 | 6,141 | mid-80s | Active |
| DROP | Reasoning | 2019 | 96,567 | low-90s | Saturated |
| Tiers captured April 2026. They reflect typical frontier-model performance bands, not specific model+score pairings (which move week to week). For point-in-time scores, follow the source link on each row to the official leaderboard. | |||||
Knowledge Benchmarks
CategoryKnowledge benchmarks test breadth of factual understanding across academic domains. MMLU (2020) was the defining benchmark of the 2020-2023 era and is now saturated; frontier models cluster in the low 90s, making it useless for current comparisons. MMLU-Pro addresses saturation with 10-choice questions and mandatory chain-of-thought, requiring more deliberate reasoning.
Humanity's Last Exam (Scale AI, 2025) is the current frontier challenge: 3,000 questions by domain experts, designed to remain hard for several years. Current SOTA sits below 20%, leaving substantial headroom. MMMU adds multimodal understanding; questions require image interpretation alongside text. For 2026 comparisons, the recommended stack is MMLU-Pro for text knowledge, MMMU for multimodal knowledge, and HLE for frontier limits.
Coding Benchmarks
CategoryCode benchmarks evolved rapidly through 2021 to 2026. HumanEval (OpenAI, 2021) with 164 Python function-completion tasks was the standard for years. It is now saturated; top models exceed the high 90s on pass@1. MBPP (Google, 2021) with 974 basic Python problems is similarly exhausted. Neither benchmark belongs in a serious 2026 comparison.
LiveCodeBench (2024) addresses contamination by drawing from LeetCode and AtCoder problems released after each model's training cutoff, making contamination structurally unlikely rather than just hopefully absent. For 2026 code comparisons, use LiveCodeBench and SWE-bench Verified. The gap between SWE-bench Verified scores and HumanEval scores reveals how different code completion is from software engineering.
Reasoning Benchmarks
CategoryReasoning is the benchmark frontier in 2026. GPQA (Rein et al., 2023) tests graduate-level, Google-proof scientific reasoning, with questions written by domain PhDs that are specifically designed to be impossible to answer by searching the web. GPQA-Diamond, the 198-question hardest subset, is the canonical reference. Frontier models cluster in the high 70s as of April 2026.
ARC-AGI-2 (Chollet, 2025) is the successor to the original Abstraction and Reasoning Corpus. ARC-AGI was cracked in late 2024 when reasoning-mode frontier models exceeded the human baseline. ARC-AGI-2 introduces significantly harder visual abstraction tasks that resist the techniques that broke the original. Current SOTA sits in the mid 60s. BIG-Bench Hard is approaching saturation.
Agentic Benchmarks
CategoryAgentic benchmarks measure multi-step task completion with tool use, a fundamentally different capability from question answering. WebArena (2023) provides 812 tasks across real web applications running in sandboxed browser environments. The agent must navigate, fill forms, search, and complete tasks without human assistance. Current SOTA sits in the high 40s, with substantial room for improvement and a real performance gap between frontier models.
AgentBench (2023) covers 8 agent environments including OS tasks, game playing, database management, and web browsing, totalling 1,091 tasks. OSWorld (2024) focuses on computer-use agents that control desktop applications. SWE-bench Verified (listed in Coding above but fundamentally agentic) is the most widely cited.
Multimodal Benchmarks
CategoryMultimodal benchmarks have grown in importance as frontier models gained vision capability. MMMU (Yue et al., 2024) is the primary knowledge benchmark for multimodal models: college-exam questions that require understanding images alongside text, across 30 subjects. MathVista (2024) tests mathematical reasoning about visual content (geometry diagrams, data charts, scientific plots). ChartQA and DocVQA cover document understanding.
MMMU has become a standard inclusion in frontier model cards in 2026 because it captures a capability dimension that text-only benchmarks entirely miss. Top scores cluster in the high 70s to low 80s, with the gap between models narrowing.
Human Preference Benchmarks
CategoryLMSYS Chatbot Arena uses blind pairwise human voting to build a Bradley-Terry Elo ranking. Over 2 million votes as of April 2026 make it the most statistically robust human-preference signal available. The top three frontier models sit in a tight Elo band (low-to-high 1380s and into the 1400s) where many query-category gaps fall within statistical noise.
Chatbot Arena measures what humans prefer in open-ended conversation, a different dimension from correctness, coding ability, or agentic capability. A model can rank highly in Arena while performing only modestly on SWE-bench, and vice versa. Both are valid measures of different capabilities.
Saturated · Historical Reference
Benchmarks here are still cited in 2026 but no longer discriminate frontier models. They remain useful as historical context, not as ranking criteria.
Reader Questions
Q.01Which benchmark should I trust for frontier model comparisons in 2026?+
Q.02How often do benchmarks get updated?+
Q.03What does saturated mean for a benchmark?+
Q.04Why do not all frontier models publish all benchmarks?+
Sources
- [1] HuggingFace Open LLM Leaderboard v2 · huggingface.co · Captured April 2026
- [2] Papers With Code MMLU · paperswithcode.com · Captured April 2026
- [3] SWE-bench Official Leaderboard · swebench.com · Captured April 2026
- [4] ARC Prize / ARC-AGI-2 · arcprize.org · Captured April 2026
- [5] LMSYS Chatbot Arena · lmsys.org · Captured April 2026
- [6] LiveCodeBench · livecodebench.github.io · Captured April 2026