SOTA progression 2023-Q1 2026. Sources: Papers With Code, HuggingFace Open LLM Leaderboard v2, official leaderboards. Captured April 2026.
AI Benchmarking 2026 - Measure Models, Agents, and Your Own Evals
A reference for how AI is measured in 2026, and how to measure your own agents. Every benchmark score on this site carries a capture date, N-shot setting, CoT flag, and a link to the primary source. No vendor affiliation. No affiliate-gated pages.
Three Layers of AI Measurement
Public Model Benchmarks
The leaderboards everyone quotes: MMLU, GPQA-Diamond, HumanEval, ARC-AGI. We cover 17 benchmarks with current 2026 SOTA scores, saturation dates, and contamination notes.
Agent Benchmarks
The newer, less-settled category. SWE-bench Verified, WebArena, AgentBench, Terminal-Bench, OSWorld, Tau-Bench. Scores are still moving fast and the methodology itself is disputed.
Your Own Evals
How teams actually measure their workflows. Braintrust, Langfuse, LangSmith, Arize, Humanloop, HoneyHive, DeepEval. A neutral, honest comparison that no vendor can publish.
Frontier Snapshot - April 2026
Top 8 frontier models across 6 canonical benchmarks. Best-in-column highlighted in phosphor. Click any model name to see the full benchmark page.
| Model | MMLU-Pro | HumanEval | GPQA-Dia | SWE-bench | ARC-AGI-2 | Arena Elo |
|---|---|---|---|---|---|---|
| Claude 4.5 Opus | 85.2% | 97.4% | 76.3% | 74.5% | 61.2% | 1389 |
| GPT-5 | 86.1% | 98.1% | 78.4% | 71.3% | 65.8% | 1401 |
| Gemini 2.5 Pro | 83.7% | 96.8% | 74.1% | 68.9% | 59.4% | 1371 |
| Claude 4 Sonnet | 81.4% | 95.3% | 71.2% | 64.6% | 54.7% | 1342 |
| Llama 4 Maverick | 79.8% | 93.7% | 66.3% | 58.2% | 48.1% | 1318 |
| Grok 4 | 82.3% | 96.1% | 72.8% | 60.4% | 52.6% | 1355 |
| Mistral Large 3 | 76.4% | 91.2% | 61.7% | 49.3% | 41.2% | 1287 |
| DeepSeek V3 | 78.9% | 92.8% | 64.5% | 53.1% | 44.8% | 1301 |
| Captured April 2026. Sources: vendor model cards, HuggingFace Open LLM Leaderboard v2, Papers With Code, LMSYS Chatbot Arena. HumanEval pass@1 0-shot. MMLU-Pro 5-shot CoT. GPQA-Diamond 0-shot CoT. SWE-bench Verified end-to-end. ARC-AGI-2 official leaderboard. See /what-these-benchmarks-miss for contamination notes. | ||||||
Benchmark Cards
MMLU-Pro
Multi-domain knowledge, 10-choice, CoT-required. Successor to saturated MMLU.
SWE-bench Verified
500 real GitHub issues. Can the agent write the patch that makes tests pass?
HumanEval
saturated164 Python functions. Mostly saturated - frontier models >97%. Use LiveCodeBench instead.
GPQA-Diamond
198 PhD-level science questions. Graduate-Google-proof. Approaching saturation.
ARC-AGI-2
Visual abstraction puzzles, harder successor to original ARC-AGI. Real headroom.
WebArena
Web navigation agent tasks. Real browser environments, 812 tasks.
Chatbot Arena
2M+ human pairwise votes. Crowdsourced human preference, not capability.
Humanity's Last Exam
3,000 expert-written questions. Designed to remain hard for years.
BIG-Bench Hard
saturated23 hard reasoning tasks that resisted earlier models. Now approaching saturation.
What Most Listicles Miss
The benchmark landscape in 2026 has three systemic problems that most coverage never mentions. First, contamination: MMLU test questions have been found verbatim in Common Crawl; HumanEval problems are near-duplicates of LeetCode solutions that appear in pre-training data. A 94% MMLU score might reflect memorisation as much as reasoning.
Second, saturation: MMLU, HumanEval, and MBPP no longer discriminate frontier models. The field has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam - but most comparison sites still quote the saturated versions because the numbers are larger and more familiar.
Third, methodology opacity: "best-of-16 with CoT and tool use" is not comparable to "greedy 0-shot." Scores without methodology footnotes are unfalsifiable claims. Every table on this site documents the evaluation setup.
Eval Tools - Quick View
| Tool | Type | Best for | Free tier |
|---|---|---|---|
| Braintrust | Cloud | CI integration, developer experience | Yes |
| Langfuse | OSS + Cloud | Self-hosting, cost-conscious teams | Generous |
| LangSmith | Cloud | LangChain users | Limited |
| Arize Phoenix | OSS + Cloud | Production monitoring, tracing | Yes (OSS) |