Abstract

Headline threeGPQA-Diamond (graduate science), ARC-AGI-2 (pattern), HLE (frontier breadth)

Math-specificMATH (broad) and AIME (hard)

Multi-task sanityBIG-Bench Hard (saturating)

For scoresEach benchmark's official leaderboard; this page does not reprint per-model numbers

Section II.xvi · Benchmark Comparison|Reviewed 2026

Reasoning Benchmarks: The 2026 Selection Guide

Reasoning is several different capabilities: graduate-domain scientific reasoning, abstract pattern reasoning, frontier-difficulty knowledge-and-reasoning, mathematical reasoning. Different benchmarks measure different slices. GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam are the three headline benchmarks; quote two together for serious frontier claims.

Reasoning is several capabilities

Like tool-use and RAG, reasoning is not a single capability but a cluster of related ones. Graduate-domain scientific reasoning (asking the same question a PhD physicist or biologist would ask) is different from abstract pattern reasoning (recognising structural similarity across novel examples). Both are different from mathematical reasoning (multi-step computation and proof). Different benchmarks measure different slices, and the right choice depends on which slice you want to evaluate.

The 2026 reasoning-benchmark landscape has settled around three headline benchmarks: GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for abstract pattern reasoning, and Humanity's Last Exam for frontier-difficulty knowledge-and-reasoning across many domains. Math-specific benchmarks (MATH, AIME) supplement these with mathematical reasoning specifically. BIG-Bench Hard remains a useful multi-task sanity check but is less central than it was in 2023.

The headline pattern: GPQA-Diamond is closer to saturation and is the most discriminating for graduate scientific reasoning. ARC-AGI-2 has the largest reported gap to human performance and is the cleanest test of pattern-generalisation capability. HLE is the broadest expert-difficulty test with substantial headroom. Together, the three give a complete reasoning portfolio. For current per-model scores on any of them, read the benchmark's own leaderboard rather than a reprinted figure.

Benchmark-by-benchmark comparison

The full picture of reasoning benchmarks in 2026 spans the headline three plus several specialised options. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

Note

Recommend

GPQA-Diamond

198 graduate-level Google-proof questions in physics, biology, chemistry

Three sciences only; saturating

Yes (graduate reasoning)

ARC-AGI-2

Visual abstract-reasoning puzzles

Visual format limits applicability

Yes (pattern reasoning)

Humanity's Last Exam

~2,500 expert-authored frontier questions across many domains

English-centric; small-N per domain

Yes (frontier knowledge)

MATH

12,500 high-school + competition mathematics problems

Broadly saturated; narrow

Math-specific only

AIME

American Invitational Math Exam problems

Math-specific; small N

Math-specific only

BIG-Bench Hard

23-task subset of BIG-Bench focused on hard reasoning

Saturating; less discriminating

Sanity check

GPQA (full)

Full ~448-question GPQA benchmark

Less discriminating than Diamond

Diamond subset preferred

We deliberately do not reprint a current-frontier score column. Cross-paper benchmark numbers carry a several-point comparability margin and the official boards lag the current frontier, so any single reprinted percentage would be unreliable. Take each score from the benchmark's own published results under a disclosed harness, or use the per-task picker on the homepage.

Use-case-by-use-case selection guide

The right benchmark depends on the reasoning slice. Graduate-domain reasoning, pattern recognition, frontier breadth, and math-specific reasoning are different evaluation targets. The table below maps common reasoning-evaluation use cases to the benchmarks that best answer them.

Use case

Primary benchmark

Secondary

Graduate-level scientific reasoning

GPQA-Diamond

HLE

Pattern recognition / generalisation

ARC-AGI-2

BIG-Bench Hard

Frontier difficulty across many domains

Humanity's Last Exam

GPQA-Diamond

Mathematics specifically

AIME (hard) / MATH (broad)

GPQA-Diamond physics

Reasoning-model differentiation

HLE (largest gap)

ARC-AGI-2

Multi-task reasoning sanity check

BIG-Bench Hard

MMLU-Pro

GPQA-Diamond and graduate-level reasoning

GPQA-Diamond is the 198-question diamond subset of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. Each question was written by a domain expert (typically a PhD-level researcher in the relevant field) and validated to be Google-resistant: a non-expert with internet access cannot solve it within 30 minutes. The questions require domain knowledge and multi-step reasoning that genuine expertise enables.

The original GPQA paper reports expert-human accuracy on the diamond subset at roughly 65 percent for graduate-level domain experts working alone, and the strongest current models have reported results above that human baseline on this specific 198-question Google-proof multiple-choice format. Read that carefully: it does not mean models are better scientists, only that they score higher on this narrow format. For exact current numbers, consult the benchmark's published results.

GPQA-Diamond is approaching saturation faster than HLE. Frontier reasoning-focused models now cluster closely on it, so its discriminating power for top-tier comparison is narrowing. Within a couple of years the benchmark will likely lose much of that power and the field will shift to harder successors; HLE is the most natural one.

ARC-AGI-2 and the pattern-reasoning headroom

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. The benchmark contains visual abstract-reasoning puzzles: given a few input-output examples of a pattern transformation, the model must apply the transformation to a new input. The puzzles test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to memorise from training data because the underlying patterns are novel.

ARC-AGI-2 reports one of the largest gaps to human performance among current major reasoning benchmarks, which makes it the clearest test of where current models are still weak relative to humans. The benchmark's value is precisely this gap: it identifies a real capability deficit that other benchmarks are saturating away. For the current model-versus-human standings, see the ARC Prize leaderboard rather than a reprinted figure.

ARC-AGI-2 has been resistant to brute-force scaling: frontier model improvements on other benchmarks have not translated proportionally to ARC-AGI-2 gains. This is consistent with Chollet's framing of the benchmark as testing intelligence-as-skill-acquisition rather than intelligence-as-task-performance. The benchmark's headroom and difficulty pattern make it one of the most interesting reasoning benchmarks to watch through 2027 and 2028.

Humanity's Last Exam: the broad frontier test

Humanity's Last Exam covers expert-curated frontier-difficulty questions across mathematics, physics, chemistry, biology, computer science, classical languages, philosophy, and other expert domains. With around 2,500 questions and substantial headroom for current models, it is the broadest frontier-difficulty knowledge-and-reasoning benchmark currently in active use.

HLE's most distinctive feature for reasoning evaluation is the gap it tends to show between reasoning-focused models and general models, which is wider than on MMLU-Pro or GPQA-Diamond. This makes HLE one of the cleanest single benchmarks for differentiating reasoning-specialised models from general-purpose models. Take the actual scores from its published results.

For the headline reasoning claim about a frontier model in 2026, we recommend quoting GPQA-Diamond (graduate scientific reasoning, near-saturation discriminator) and HLE (broad frontier reasoning, large headroom) together. ARC-AGI-2 is the right addition for claims about generalisation specifically. The combination gives the most complete reasoning picture available.

Math benchmarks and BIG-Bench Hard

Math-specific benchmarks (MATH, AIME) are reasoning benchmarks in the narrow sense: they test multi-step mathematical reasoning. MATH (12,500 problems) is the broader benchmark, ranging from basic algebra to competition-level proofs, and is broadly saturated for frontier models. AIME (American Invitational Mathematics Examination) is the harder benchmark, but the per-year sample size is small (15-30 problems per AIME) and noise is consequential. Quote AIME for hard math claims; quote MATH for broader math sanity checks; take the numbers from the relevant published results.

BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. The subset includes logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. Many BBH tasks are now partially saturated by 2026 frontier models; the benchmark remains useful as a multi-task reasoning sanity check but is less discriminating than the headline three for frontier comparisons. Quote BBH when multi-task breadth is the question; for frontier reasoning specifically, prefer GPQA-Diamond or HLE.

Editor's verdictQuote GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for pattern reasoning generalisation, HLE for frontier breadth across many domains. Add MATH or AIME for math-specific claims. The combination gives the most complete reasoning picture available in 2026.

Reader Questions

Q.01Which reasoning benchmark should I quote in 2026?+

For graduate-level scientific reasoning, GPQA-Diamond. For abstract pattern reasoning, ARC-AGI-2. For frontier-difficulty knowledge-and-reasoning, Humanity's Last Exam. For multi-task reasoning, BIG-Bench Hard. For mathematics specifically, MATH or AIME. The right choice depends on the kind of reasoning you want to evaluate; quote two together for serious frontier comparisons.

Q.02What is GPQA-Diamond and why is it important?+

GPQA-Diamond is the diamond subset (198 questions) of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. The questions are written by domain experts to be Google-resistant: a non-expert with internet access cannot solve them in 30 minutes. It is one of the most discriminating graduate-science reasoning benchmarks in active use; for current per-model standings, consult the benchmark's official results rather than a reprinted score.

Q.03What is ARC-AGI-2 and what does it measure?+

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. It contains visual abstract-reasoning puzzles that test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to game with memorisation. It remains one of the benchmarks with the largest reported gap between models and humans, which is its value: it identifies a capability deficit other benchmarks are saturating away. For current standings see the ARC Prize leaderboard.

Q.04How does Humanity's Last Exam compare to GPQA-Diamond?+

GPQA-Diamond tests graduate-level reasoning in three sciences (physics, biology, chemistry); HLE tests frontier-difficulty reasoning across mathematics, sciences, classical languages, philosophy, and other expert domains. GPQA-Diamond is more focused; HLE is broader and has more headroom. Quote both for serious frontier-reasoning claims, taking each score from the benchmark's own published results under a disclosed harness.

Q.05Are math benchmarks (MATH, AIME) reasoning benchmarks?+

Yes, with caveats. MATH (12,500 competition-math problems) and AIME (American Invitational Mathematics Examination) are reasoning benchmarks in the sense that they require multi-step mathematical reasoning. They are also somewhat narrower: a model can do well on MATH or AIME by being strong on competitive-mathematics-style problems specifically. MATH is broadly saturated for frontier models; AIME's per-year sample is small, so noise is consequential. For broader reasoning evaluation, GPQA-Diamond and HLE are more representative.

Q.06What about BIG-Bench Hard?+

BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. It covers logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. Much of the subset is now partially saturated by frontier models, so it remains useful as a multi-task reasoning sanity check but is less frequently the headline number for frontier comparisons.

GPQA and ARC-AGI Deep Dive →Humanity's Last Exam →MMLU-Pro →Full Benchmark Reference →Benchmark Contamination →Chatbot Arena →What Benchmarks Miss →

Sources

[1] Rein, D. et al. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022.
[2] Chollet, F. ARC-AGI-2 project site. arcprize.org. Accessed May 2026.
[3] Phan, L. et al. (2025). Humanity's Last Exam. lastexam.ai.
[4] Hendrycks, D. et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. arXiv:2103.03874.
[5] Suzgun, M. et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261. The BIG-Bench Hard paper.