Abstract

Headline threeGPQA-Diamond (graduate science), ARC-AGI-2 (pattern), HLE (frontier breadth)

Math-specificMATH (broad) and AIME (hard)

Multi-task sanityBIG-Bench Hard (saturating)

2026 frontierGPQA-Diamond ~75%; ARC-AGI-2 25-35%; HLE 22-26%

Section II.xvi · Benchmark Comparison|Last verified April 2026

Reasoning Benchmarks: The 2026 Selection Guide

Reasoning is several different capabilities: graduate-domain scientific reasoning, abstract pattern reasoning, frontier-difficulty knowledge-and-reasoning, mathematical reasoning. Different benchmarks measure different slices. GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam are the three headline benchmarks; quote two together for serious frontier claims.

Reasoning is several capabilities

Like tool-use and RAG, reasoning is not a single capability but a cluster of related ones. Graduate-domain scientific reasoning (asking the same question a PhD physicist or biologist would ask) is different from abstract pattern reasoning (recognising structural similarity across novel examples). Both are different from mathematical reasoning (multi-step computation and proof). Different benchmarks measure different slices, and the right choice depends on which slice you want to evaluate.

The 2026 reasoning-benchmark landscape has settled around three headline benchmarks: GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for abstract pattern reasoning, and Humanity's Last Exam for frontier-difficulty knowledge-and-reasoning across many domains. Math-specific benchmarks (MATH, AIME) supplement these with mathematical reasoning specifically. BIG-Bench Hard remains a useful multi-task sanity check but is less central than it was in 2023.

The headline pattern: GPQA-Diamond is closer to saturation (frontier ~75 percent) and is the most discriminating for graduate scientific reasoning. ARC-AGI-2 has the largest headroom (frontier 25-35 percent vs human 80-90 percent) and is the cleanest test of pattern-generalisation capability. HLE is the broadest expert-difficulty test with substantial headroom. Together, the three give a complete reasoning portfolio.

Benchmark-by-benchmark comparison

The full picture of reasoning benchmarks in 2026 spans the headline three plus several specialised options. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

2026 frontier

Note

Recommend

GPQA-Diamond

198 graduate-level Google-proof questions in physics, biology, chemistry

~75% (frontier reasoning)

Three sciences only; saturating

Yes (graduate reasoning)

ARC-AGI-2

Visual abstract-reasoning puzzles

25-35% (vs human 80-90%)

Visual format limits applicability

Yes (pattern reasoning)

Humanity's Last Exam

~2,500 expert-authored frontier questions across many domains

22-26% (reasoning-models lead)

English-centric; small-N per domain

Yes (frontier knowledge)

MATH

12,500 high-school + competition mathematics problems

Saturating (~95%+)

Saturated; narrow

Math-specific only

AIME

American Invitational Math Exam problems

Strong reasoning models 80%+

Math-specific; small N

Math-specific only

BIG-Bench Hard

23-task subset of BIG-Bench focused on hard reasoning

Partially saturated

Saturating; less discriminating

Sanity check

GPQA (full)

Full ~448-question GPQA benchmark

Higher than Diamond subset

Less discriminating than Diamond

Diamond subset preferred

Use-case-by-use-case selection guide

The right benchmark depends on the reasoning slice. Graduate-domain reasoning, pattern recognition, frontier breadth, and math-specific reasoning are different evaluation targets. The table below maps common reasoning-evaluation use cases to the benchmarks that best answer them.

Use case

Primary benchmark

Secondary

Graduate-level scientific reasoning

GPQA-Diamond

HLE

Pattern recognition / generalisation

ARC-AGI-2

BIG-Bench Hard

Frontier difficulty across many domains

Humanity's Last Exam

GPQA-Diamond

Mathematics specifically

AIME (hard) / MATH (broad)

GPQA-Diamond physics

Reasoning-model differentiation

HLE (largest gap)

ARC-AGI-2

Multi-task reasoning sanity check

BIG-Bench Hard

MMLU-Pro

GPQA-Diamond and graduate-level reasoning

GPQA-Diamond is the 198-question diamond subset of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. Each question was written by a domain expert (typically a PhD-level researcher in the relevant field) and validated to be Google-resistant: a non-expert with internet access cannot solve it within 30 minutes. The questions require domain knowledge and multi-step reasoning that genuine expertise enables.

Frontier models in 2026 score around 75 percent on GPQA-Diamond, which is above expert-human accuracy (around 65 percent for graduate-level domain experts working alone). This crossing of expert-human baseline happened in mid-2025 and is one of the field's clearest examples of LLM capability exceeding domain-expert performance on a focused benchmark. The result should be read carefully: it does not mean LLMs are better scientists, it means LLMs perform better than domain experts on this specific 198-question Google-proof multiple-choice format.

GPQA-Diamond is approaching saturation faster than HLE. Frontier reasoning-focused models cluster within 5-8 points of each other; the discriminating power for top-tier comparison is narrowing. Within 18-24 months, the benchmark will likely lose its current discriminating power and the field will shift to harder successors. HLE is the most natural successor; a possible GPQA-Diamond-Hard or GPQA-Plat is also conceivable.

ARC-AGI-2 and the pattern-reasoning headroom

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. The benchmark contains visual abstract-reasoning puzzles: given a few input-output examples of a pattern transformation, the model must apply the transformation to a new input. The puzzles test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to memorise from training data because the underlying patterns are novel.

Frontier models in 2026 score around 25-35 percent on the public ARC-AGI-2 set; humans score around 80-90 percent. The 50-point gap is the largest gap to human performance among current major reasoning benchmarks, which makes ARC-AGI-2 the clearest test of where current LLMs are still weak relative to humans. The benchmark's value is precisely this gap: it identifies a real capability deficit that other benchmarks are saturating away.

ARC-AGI-2 has been resistant to brute-force scaling: frontier model improvements on other benchmarks have not translated proportionally to ARC-AGI-2 gains. This is consistent with Chollet's framing of the benchmark as testing intelligence-as-skill-acquisition rather than intelligence-as-task-performance. The benchmark's headroom and difficulty pattern make it one of the most interesting reasoning benchmarks to watch through 2027 and 2028.

Humanity's Last Exam: the broad frontier test

Humanity's Last Exam covers expert-curated frontier-difficulty questions across mathematics, physics, chemistry, biology, computer science, classical languages, philosophy, and other expert domains. With around 2,500 questions and frontier scores in the low 20s percent in May 2026, it is the broadest frontier-difficulty knowledge-and-reasoning benchmark currently in active use.

HLE's most distinctive feature for reasoning evaluation is the gap between reasoning-focused models and general models. The 6-8 point gap between reasoning-model and general-model scores is much larger on HLE than on MMLU-Pro (1-2 points) or GPQA-Diamond (3-4 points). This makes HLE the cleanest single benchmark for differentiating reasoning-specialised models from general-purpose models.

For the headline reasoning claim about a frontier model in 2026, we recommend quoting GPQA-Diamond (graduate scientific reasoning, near-saturation discriminator) and HLE (broad frontier reasoning, large headroom) together. ARC-AGI-2 is the right addition for claims about generalisation specifically. The combination gives the most complete reasoning picture available.

Math benchmarks and BIG-Bench Hard

Math-specific benchmarks (MATH, AIME) are reasoning benchmarks in the narrow sense: they test multi-step mathematical reasoning. MATH (12,500 problems) is the broader benchmark, ranging from basic algebra to competition-level proofs; frontier models in 2026 score over 95 percent on MATH, which means the benchmark is essentially saturated. AIME (American Invitational Mathematics Examination) is the harder benchmark; strong reasoning-focused models score 80-plus percent on AIME problems but the per-year sample size is small (15-30 problems per AIME) and noise is consequential. Quote AIME for hard math claims; quote MATH for broader math sanity checks.

BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. The subset includes logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. Many BBH tasks are now partially saturated by 2026 frontier models; the benchmark remains useful as a multi-task reasoning sanity check but is less discriminating than the headline three for frontier comparisons. Quote BBH when multi-task breadth is the question; for frontier reasoning specifically, prefer GPQA-Diamond or HLE.

Editor's verdictQuote GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for pattern reasoning generalisation, HLE for frontier breadth across many domains. Add MATH or AIME for math-specific claims. The combination gives the most complete reasoning picture available in 2026.

Reader Questions

Q.01Which reasoning benchmark should I quote in 2026?+

For graduate-level scientific reasoning, GPQA-Diamond. For abstract pattern reasoning, ARC-AGI-2. For frontier-difficulty knowledge-and-reasoning, Humanity's Last Exam. For multi-task reasoning, BIG-Bench Hard. For mathematics specifically, MATH or AIME. The right choice depends on the kind of reasoning you want to evaluate; quote two together for serious frontier comparisons.

Q.02What is GPQA-Diamond and why is it important?+

GPQA-Diamond is the diamond subset (198 questions) of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. The questions are written by domain experts to be Google-resistant: a non-expert with internet access cannot solve them in 30 minutes. Frontier models in 2026 score around 75 percent on GPQA-Diamond; the gap to expert-human performance (around 65 percent for graduate-level domain experts) has now closed and reversed for the strongest models.

Q.03What is ARC-AGI-2 and what does it measure?+

ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. It contains visual abstract-reasoning puzzles that test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to game with memorisation. Frontier models in 2026 score around 25-35 percent on the public set; humans score around 80-90 percent. The benchmark is the cleanest test of pattern-reasoning generalisation in the field, and the headroom to human performance is substantial.

Q.04How does Humanity's Last Exam compare to GPQA-Diamond?+

GPQA-Diamond tests graduate-level reasoning in three sciences (physics, biology, chemistry); HLE tests frontier-difficulty reasoning across mathematics, sciences, classical languages, philosophy, and other expert domains. GPQA-Diamond is more focused; HLE is broader. Current frontier scores GPQA-Diamond ~75 percent and HLE ~22-26 percent, which means GPQA-Diamond is closer to saturation and HLE has more headroom. Quote both for serious frontier-reasoning claims.

Q.05Are math benchmarks (MATH, AIME) reasoning benchmarks?+

Yes, with caveats. MATH (12,500 competition-math problems) and AIME (American Invitational Mathematics Examination) are reasoning benchmarks in the sense that they require multi-step mathematical reasoning. They are also somewhat narrower: a model can do well on MATH or AIME by being strong on competitive-mathematics-style problems specifically. For broader reasoning evaluation, GPQA-Diamond and HLE are more representative; for mathematics-specific claims, MATH and AIME are the canonical benchmarks.

Q.06What about BIG-Bench Hard?+

BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. It covers logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. The subset has been partially saturated by 2026 frontier models; many tasks now exceed human baseline. BIG-Bench Hard remains useful as a multi-task reasoning sanity check but is less frequently the headline number for frontier comparisons.

GPQA and ARC-AGI Deep Dive →Humanity's Last Exam →MMLU-Pro →Full Benchmark Reference →Benchmark Contamination →Chatbot Arena →What Benchmarks Miss →

Sources

[1] Rein, D. et al. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022.
[2] Chollet, F. ARC-AGI-2 project site. arcprize.org. Accessed May 2026.
[3] Phan, L. et al. (2025). Humanity's Last Exam. lastexam.ai.
[4] Hendrycks, D. et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. arXiv:2103.03874.
[5] Suzgun, M. et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261. The BIG-Bench Hard paper.