Reasoning Benchmarks: The 2026 Selection Guide
Reasoning is several different capabilities: graduate-domain scientific reasoning, abstract pattern reasoning, frontier-difficulty knowledge-and-reasoning, mathematical reasoning. Different benchmarks measure different slices. GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam are the three headline benchmarks; quote two together for serious frontier claims.
Reasoning is several capabilities
Like tool-use and RAG, reasoning is not a single capability but a cluster of related ones. Graduate-domain scientific reasoning (asking the same question a PhD physicist or biologist would ask) is different from abstract pattern reasoning (recognising structural similarity across novel examples). Both are different from mathematical reasoning (multi-step computation and proof). Different benchmarks measure different slices, and the right choice depends on which slice you want to evaluate.
The 2026 reasoning-benchmark landscape has settled around three headline benchmarks: GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for abstract pattern reasoning, and Humanity's Last Exam for frontier-difficulty knowledge-and-reasoning across many domains. Math-specific benchmarks (MATH, AIME) supplement these with mathematical reasoning specifically. BIG-Bench Hard remains a useful multi-task sanity check but is less central than it was in 2023.
The headline pattern: GPQA-Diamond is closer to saturation (frontier ~75 percent) and is the most discriminating for graduate scientific reasoning. ARC-AGI-2 has the largest headroom (frontier 25-35 percent vs human 80-90 percent) and is the cleanest test of pattern-generalisation capability. HLE is the broadest expert-difficulty test with substantial headroom. Together, the three give a complete reasoning portfolio.
Benchmark-by-benchmark comparison
The full picture of reasoning benchmarks in 2026 spans the headline three plus several specialised options. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.
Use-case-by-use-case selection guide
The right benchmark depends on the reasoning slice. Graduate-domain reasoning, pattern recognition, frontier breadth, and math-specific reasoning are different evaluation targets. The table below maps common reasoning-evaluation use cases to the benchmarks that best answer them.
GPQA-Diamond and graduate-level reasoning
GPQA-Diamond is the 198-question diamond subset of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. Each question was written by a domain expert (typically a PhD-level researcher in the relevant field) and validated to be Google-resistant: a non-expert with internet access cannot solve it within 30 minutes. The questions require domain knowledge and multi-step reasoning that genuine expertise enables.
Frontier models in 2026 score around 75 percent on GPQA-Diamond, which is above expert-human accuracy (around 65 percent for graduate-level domain experts working alone). This crossing of expert-human baseline happened in mid-2025 and is one of the field's clearest examples of LLM capability exceeding domain-expert performance on a focused benchmark. The result should be read carefully: it does not mean LLMs are better scientists, it means LLMs perform better than domain experts on this specific 198-question Google-proof multiple-choice format.
GPQA-Diamond is approaching saturation faster than HLE. Frontier reasoning-focused models cluster within 5-8 points of each other; the discriminating power for top-tier comparison is narrowing. Within 18-24 months, the benchmark will likely lose its current discriminating power and the field will shift to harder successors. HLE is the most natural successor; a possible GPQA-Diamond-Hard or GPQA-Plat is also conceivable.
ARC-AGI-2 and the pattern-reasoning headroom
ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. The benchmark contains visual abstract-reasoning puzzles: given a few input-output examples of a pattern transformation, the model must apply the transformation to a new input. The puzzles test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to memorise from training data because the underlying patterns are novel.
Frontier models in 2026 score around 25-35 percent on the public ARC-AGI-2 set; humans score around 80-90 percent. The 50-point gap is the largest gap to human performance among current major reasoning benchmarks, which makes ARC-AGI-2 the clearest test of where current LLMs are still weak relative to humans. The benchmark's value is precisely this gap: it identifies a real capability deficit that other benchmarks are saturating away.
ARC-AGI-2 has been resistant to brute-force scaling: frontier model improvements on other benchmarks have not translated proportionally to ARC-AGI-2 gains. This is consistent with Chollet's framing of the benchmark as testing intelligence-as-skill-acquisition rather than intelligence-as-task-performance. The benchmark's headroom and difficulty pattern make it one of the most interesting reasoning benchmarks to watch through 2027 and 2028.
Humanity's Last Exam: the broad frontier test
Humanity's Last Exam covers expert-curated frontier-difficulty questions across mathematics, physics, chemistry, biology, computer science, classical languages, philosophy, and other expert domains. With around 2,500 questions and frontier scores in the low 20s percent in May 2026, it is the broadest frontier-difficulty knowledge-and-reasoning benchmark currently in active use.
HLE's most distinctive feature for reasoning evaluation is the gap between reasoning-focused models and general models. The 6-8 point gap between reasoning-model and general-model scores is much larger on HLE than on MMLU-Pro (1-2 points) or GPQA-Diamond (3-4 points). This makes HLE the cleanest single benchmark for differentiating reasoning-specialised models from general-purpose models.
For the headline reasoning claim about a frontier model in 2026, we recommend quoting GPQA-Diamond (graduate scientific reasoning, near-saturation discriminator) and HLE (broad frontier reasoning, large headroom) together. ARC-AGI-2 is the right addition for claims about generalisation specifically. The combination gives the most complete reasoning picture available.
Math benchmarks and BIG-Bench Hard
Math-specific benchmarks (MATH, AIME) are reasoning benchmarks in the narrow sense: they test multi-step mathematical reasoning. MATH (12,500 problems) is the broader benchmark, ranging from basic algebra to competition-level proofs; frontier models in 2026 score over 95 percent on MATH, which means the benchmark is essentially saturated. AIME (American Invitational Mathematics Examination) is the harder benchmark; strong reasoning-focused models score 80-plus percent on AIME problems but the per-year sample size is small (15-30 problems per AIME) and noise is consequential. Quote AIME for hard math claims; quote MATH for broader math sanity checks.
BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. The subset includes logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. Many BBH tasks are now partially saturated by 2026 frontier models; the benchmark remains useful as a multi-task reasoning sanity check but is less discriminating than the headline three for frontier comparisons. Quote BBH when multi-task breadth is the question; for frontier reasoning specifically, prefer GPQA-Diamond or HLE.
Q.01Which reasoning benchmark should I quote in 2026?+
Q.02What is GPQA-Diamond and why is it important?+
Q.03What is ARC-AGI-2 and what does it measure?+
Q.04How does Humanity's Last Exam compare to GPQA-Diamond?+
Q.05Are math benchmarks (MATH, AIME) reasoning benchmarks?+
Q.06What about BIG-Bench Hard?+
Sources
- [1] Rein, D. et al. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022.
- [2] Chollet, F. ARC-AGI-2 project site. arcprize.org. Accessed May 2026.
- [3] Phan, L. et al. (2025). Humanity's Last Exam. lastexam.ai.
- [4] Hendrycks, D. et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. arXiv:2103.03874.
- [5] Suzgun, M. et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261. The BIG-Bench Hard paper.