Agent Benchmark Leaderboard 2026: AgentBench, SWE-bench, GAIA
17 benchmarks. 8 eval tools. Every score dated, sourced, and annotated. No vendor capture.
Independent reference for ML engineers choosing, defending, and shipping models. Covers public leaderboards (MMLU-Pro, GPQA-Diamond, ARC-AGI-2), agent benchmarks (SWE-bench Verified, WebArena, Tau-Bench, OSWorld, Terminal-Bench), and the practitioner stack. Every number carries capture date, N-shot setting, and source.
Which model wins for YOUR use case?
Pick what you're actually trying to build. We'll show the benchmark that matters, the May 2026 leaderboard, and the citation for every number.
Every entry above is verified against the cited primary source. Re-verified monthly — last sweep 25 May 2026. Disagreement with vendor marketing is expected and a feature.
Three Layers of AI Measurement
Most listicles cover the first layer in isolation. We thread all three: public benchmarks, agent benchmarks, and your own eval pipeline.
Public Model Benchmarks
The leaderboards everyone quotes. We cover 17 benchmarks across knowledge, coding, reasoning, and multimodal, with current 2026 SOTA, saturation status, and contamination notes per row.
Agent Benchmarks
The newer, less-settled frontier. Multi-step task completion in real environments. Methodology is disputed and scores are still moving fast. We document the strengths and gaming vectors of each.
Your Own Evals
How teams actually measure the workflow they ship. Custom golden datasets, LLM-as-judge, online monitoring, regression tracking. Honest comparison of the practitioner stack, no vendor capture.
Coverage Map
We index 17 benchmarks across six categories. Each entry carries category, status (active, saturated, deprecated), source paper, official leaderboard, and a capture-dated SOTA snapshot. Click any category to read the full reference.
Open the full reference →MMLU, MMLU-Pro, MMMU, HLE
HumanEval, MBPP, LiveCodeBench
GPQA, ARC-AGI, BIG-Bench Hard
SWE-bench, WebArena, Tau-Bench
MMMU, MathVista, ChartQA
Chatbot Arena, MT-Bench
The 2026 Frontier, In Tiers
Ranks change weekly; absolute scores change quarterly. We summarise tiers and trends rather than freezing a leaderboard that will be wrong by next month.
Per-Benchmark Deep Dives
Domain-Specific Benchmarks
Eval Methodology and Tools
Per-Framework and Per-Workload
Read Every Score Like a Reviewer
A six-question rubric we apply to every benchmark cell on this site. Borrow it for your own reading.
- 01
What is the capture date?
Frontier benchmarks move weekly. A 2024 score is historical, not current. Every cell here is dated.
- 02
What N-shot, what CoT?
0-shot, 5-shot, and CoT-required runs are different tests. Quote the setting alongside the score or do not quote it.
- 03
Which test set version?
MMLU vs MMLU-Pro, SWE-bench vs Lite vs Verified, GPQA vs Diamond. The names are similar; the tests are not.
- 04
Vendor card or third party?
First-party model cards optimise for their own model. HuggingFace and Papers With Code are independent. Both have failure modes.
- 05
Public test set?
If yes, ask about contamination. MMLU questions appear in Common Crawl. HumanEval problems echo LeetCode. Verified subsets help, do not eliminate, the issue.
- 06
What harness, what tools?
Agentic scores depend on the harness. Best-of-16 with extended tools is not comparable to greedy single-shot.
The Practitioner Stack
Eight evaluation platforms reviewed independently. Open source, cloud, hybrid. Honest where the open-source option is good enough, honest where it is not. No vendor wrote this comparison.
Read the full review →What Most Listicles Miss
The benchmark landscape in 2026 has three systemic problems that most coverage ignores. First, contamination. MMLU test questions appear verbatim in Common Crawl. HumanEval problems are near-duplicates of LeetCode solutions in pre-training data. A 94% score on a saturated benchmark might reflect memorisation as much as reasoning, and there is no clean way to tell from the leaderboard.
Second, saturation. MMLU, HumanEval, and MBPP no longer discriminate frontier models. The field has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. Most comparison sites still quote the saturated versions because the numbers are larger and more familiar to their readers.
Third, methodology opacity. “Best-of-16 with chain-of-thought and tool use” is not comparable to “greedy zero-shot.” Scores published without methodology are unfalsifiable claims. Every table on this site documents the evaluation setup so the comparison stays honest.