Lead Article|Last verified April 2026|42 sources|17 benchmarks indexed

Agent Benchmark Leaderboard 2026: AgentBench, SWE-bench, GAIA

Q: Which benchmarks matter in 2026?

For frontier model comparisons in 2026: MMLU-Pro (not plain MMLU, which is saturated), GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam for reasoning. For coding: LiveCodeBench and SWE-bench Verified. For agentic capability: SWE-bench Verified, WebArena, and Terminal-Bench. For human preference: LMSYS Chatbot Arena.

17 benchmarks. 8 eval tools. Every score dated, sourced, and annotated. No vendor capture.

Independent reference for ML engineers choosing, defending, and shipping models. Covers public leaderboards (MMLU-Pro, GPQA-Diamond, ARC-AGI-2), agent benchmarks (SWE-bench Verified, WebArena, Tau-Bench, OSWorld, Terminal-Bench), and the practitioner stack. Every number carries capture date, N-shot setting, and source.

Browse Benchmarks →Compare Eval Tools What Benchmarks Miss

Task → benchmark → leader

Which model wins for YOUR use case?

Pick what you're actually trying to build. We'll show the benchmark that matters, the May 2026 leaderboard, and the citation for every number.

Your use case

Every entry above is verified against the cited primary source. Re-verified monthly — last sweep 25 May 2026. Disagreement with vendor marketing is expected and a feature.

Section I

Three Layers of AI Measurement

Most listicles cover the first layer in isolation. We thread all three: public benchmarks, agent benchmarks, and your own eval pipeline.

Movement IKnowledge / Coding / Reasoning

Public Model Benchmarks

The leaderboards everyone quotes. We cover 17 benchmarks across knowledge, coding, reasoning, and multimodal, with current 2026 SOTA, saturation status, and contamination notes per row.

MMLU-ProGPQA-DiamondARC-AGI-2HumanEvalLiveCodeBench

Continue Reading →

Movement IIAgentic

Agent Benchmarks

The newer, less-settled frontier. Multi-step task completion in real environments. Methodology is disputed and scores are still moving fast. We document the strengths and gaming vectors of each.

SWE-bench VerifiedWebArenaTau-BenchOSWorldTerminal-Bench

Continue Reading →

Movement IIIPractitioner

Your Own Evals

How teams actually measure the workflow they ship. Custom golden datasets, LLM-as-judge, online monitoring, regression tracking. Honest comparison of the practitioner stack, no vendor capture.

BraintrustLangfuseLangSmithArize PhoenixDeepEval

Continue Reading →

Section II

Coverage Map

We index 17 benchmarks across six categories. Each entry carries category, status (active, saturated, deprecated), source paper, official leaderboard, and a capture-dated SOTA snapshot. Click any category to read the full reference.

Open the full reference →

Knowledge05

MMLU, MMLU-Pro, MMMU, HLE

Coding04

HumanEval, MBPP, LiveCodeBench

Reasoning05

GPQA, ARC-AGI, BIG-Bench Hard

Agentic06

SWE-bench, WebArena, Tau-Bench

Multimodal04

MMMU, MathVista, ChartQA

Preference03

Chatbot Arena, MT-Bench

Section III

The 2026 Frontier, In Tiers

Ranks change weekly; absolute scores change quarterly. We summarise tiers and trends rather than freezing a leaderboard that will be wrong by next month.

Rank

Tier

Strength Profile

Trend

Frontier Tier A

Reasoning + Coding + Agentic

↑ rising

Frontier Tier B

Reasoning + Coding

↑ rising

Frontier Tier C

Knowledge + Multimodal

↑ stable

Open-weight Tier A

Coding + Reasoning

↑ rising

Open-weight Tier B

General

↑ stable

Captured April 2026. Tiers reflect aggregate performance across MMLU-Pro, GPQA-Diamond, SWE-bench Verified, ARC-AGI-2, and Chatbot Arena. Specific model rankings change frequently; refer to the per-benchmark pages and primary leaderboards for current numbers.

SWE-bench Verifiedopen →MMLU and MMLU-Proopen →GPQA and ARC-AGIopen →

Expanded Reference

Per-Benchmark Deep Dives

AgentBench→WebArena→GAIA→Tau-Bench→Tau-Bench Retail and Airline→OSWorld→Terminal-Bench→MMLU-Pro→LiveCodeBench→Chatbot Arena→Humanity's Last Exam→MATH→BIG-Bench Hard→AppWorld→AssistantBench→AndroidWorld→Mind2Web→BFCL v3 (Function-Calling)→MMMU→

Industry Domain

Domain-Specific Benchmarks

LegalBench (162 legal-reasoning tasks)→MedQA + MultiMedQA (medical)→RepoBench (multi-file code)→

Methodology and Tooling

Eval Methodology and Tools

pass@1 vs pass@k vs pass^k→Prompt template variance→Why benchmark scores fail to reproduce→Safety eval suites→Inspect (UK AISI)→OpenAI Evals→Stanford HELM→BrowserGym→Agent eval cost calculator→Frontier models on benchmarks (table)→

Frameworks & Selection Guides

Per-Framework and Per-Workload

LangGraph Benchmarks→AutoGen / AG2 Benchmarks→CrewAI Benchmarks→OpenAI Agents SDK→DSPy Benchmarks→Coding-Agent Benchmarks Compared→Browser-Agent Benchmarks Compared→RAG Benchmarks Compared→Tool-Use Benchmarks Compared→Reasoning Benchmarks Compared→Cost Per Eval Reference→Benchmark Contamination→

Section IV · Editorial Method

Read Every Score Like a Reviewer

A six-question rubric we apply to every benchmark cell on this site. Borrow it for your own reading.

01
What is the capture date?
Frontier benchmarks move weekly. A 2024 score is historical, not current. Every cell here is dated.
02
What N-shot, what CoT?
0-shot, 5-shot, and CoT-required runs are different tests. Quote the setting alongside the score or do not quote it.
03
Which test set version?
MMLU vs MMLU-Pro, SWE-bench vs Lite vs Verified, GPQA vs Diamond. The names are similar; the tests are not.
04
Vendor card or third party?
First-party model cards optimise for their own model. HuggingFace and Papers With Code are independent. Both have failure modes.
05
Public test set?
If yes, ask about contamination. MMLU questions appear in Common Crawl. HumanEval problems echo LeetCode. Verified subsets help, do not eliminate, the issue.
06
What harness, what tools?
Agentic scores depend on the harness. Best-of-16 with extended tools is not comparable to greedy single-shot.

Full editorial: What benchmarks miss →

Section V

The Practitioner Stack

Eight evaluation platforms reviewed independently. Open source, cloud, hybrid. Honest where the open-source option is good enough, honest where it is not. No vendor wrote this comparison.

Read the full review →

Tool

Type

Best For

Free Tier

Braintrust

Cloud

CI integration, developer workflow

Yes

Langfuse

OSS + Cloud

Self-hosting, cost-conscious teams

Generous

LangSmith

Cloud

LangChain ecosystem

Limited

Arize Phoenix

OSS + Cloud

Production tracing and monitoring

OSS, free

Section VI · Editorial

What Most Listicles Miss

The benchmark landscape in 2026 has three systemic problems that most coverage ignores. First, contamination. MMLU test questions appear verbatim in Common Crawl. HumanEval problems are near-duplicates of LeetCode solutions in pre-training data. A 94% score on a saturated benchmark might reflect memorisation as much as reasoning, and there is no clean way to tell from the leaderboard.

Second, saturation. MMLU, HumanEval, and MBPP no longer discriminate frontier models. The field has moved to MMLU-Pro, GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. Most comparison sites still quote the saturated versions because the numbers are larger and more familiar to their readers.

Third, methodology opacity. “Best-of-16 with chain-of-thought and tool use” is not comparable to “greedy zero-shot.” Scores published without methodology are unfalsifiable claims. Every table on this site documents the evaluation setup so the comparison stays honest.

Continue editorial →Benchmark reference

Section VII

Reader Questions

Q.01What is AI benchmarking?+

AI benchmarking is the practice of measuring model or agent performance on standardised test sets to enable objective comparison. Benchmarks range from knowledge tests (MMLU, GPQA) to coding tasks (HumanEval, SWE-bench) to agentic challenges (WebArena, OSWorld). Every benchmark score should be read as a claim with a specific methodology, not a universal fact.

Q.02Which benchmarks matter in 2026?+

For frontier model comparisons: MMLU-Pro (not plain MMLU, which is saturated), GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. For coding: LiveCodeBench and SWE-bench Verified. For agentic capability: SWE-bench Verified, WebArena, and Terminal-Bench. For human preference: LMSYS Chatbot Arena.

Q.03Are benchmark scores reliable?+

Benchmark scores are useful but require critical reading. Key questions: When was this captured? What N-shot and CoT settings? Is the score from the official leaderboard or a vendor model card? Is the test set public (contamination risk)? MMLU, HumanEval, and MBPP have documented training-data overlap concerns.

Q.04What is the difference between an eval and a benchmark?+

A benchmark is a standardised public test set used to compare models across the field. An eval is any measurement of model quality. It may use a public benchmark, a custom golden dataset, LLM-as-judge scoring, or human annotation. Public benchmarks are one kind of eval; custom evals are the other kind, built for specific workflows.

Q.05Should I trust vendor-published benchmark scores?+

Treat vendor-published scores as a starting point, not a final answer. Model cards are produced by the same company that built the model. Methodology details are often omitted or buried. Independent replications on Papers With Code or HuggingFace's Open LLM Leaderboard v2 are more reliable, though not immune to issues.

See all 14 reader questions →

#1Claude Opus 4.7	79.4%
#2GPT-5	74.9%
#3Claude Sonnet 4.6	72.5%
#4Gemini 2.5 Pro	63.8%
#5Claude Sonnet 3.7	49%

Agent Benchmark Leaderboard 2026: AgentBench, SWE-bench, GAIA

Which model wins for YOUR use case?

Three Layers of AI Measurement

Public Model Benchmarks

Agent Benchmarks

Your Own Evals

Coverage Map

The 2026 Frontier, In Tiers

Per-Benchmark Deep Dives

Domain-Specific Benchmarks

Eval Methodology and Tools

Per-Framework and Per-Workload

Read Every Score Like a Reviewer

What is the capture date?

What N-shot, what CoT?

Which test set version?

Vendor card or third party?

Public test set?

What harness, what tools?

The Practitioner Stack

What Most Listicles Miss

Reader Questions