Reasoning Benchmarks: The 2026 Selection Guide
Reasoning is several different capabilities: graduate-domain scientific reasoning, abstract pattern reasoning, frontier-difficulty knowledge-and-reasoning, mathematical reasoning. Different benchmarks measure different slices. GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam are the three headline benchmarks; quote two together for serious frontier claims.
Reasoning is several capabilities
Like tool-use and RAG, reasoning is not a single capability but a cluster of related ones. Graduate-domain scientific reasoning (asking the same question a PhD physicist or biologist would ask) is different from abstract pattern reasoning (recognising structural similarity across novel examples). Both are different from mathematical reasoning (multi-step computation and proof). Different benchmarks measure different slices, and the right choice depends on which slice you want to evaluate.
The 2026 reasoning-benchmark landscape has settled around three headline benchmarks: GPQA-Diamond for graduate scientific reasoning, ARC-AGI-2 for abstract pattern reasoning, and Humanity's Last Exam for frontier-difficulty knowledge-and-reasoning across many domains. Math-specific benchmarks (MATH, AIME) supplement these with mathematical reasoning specifically. BIG-Bench Hard remains a useful multi-task sanity check but is less central than it was in 2023.
The headline pattern: GPQA-Diamond is closer to saturation and is the most discriminating for graduate scientific reasoning. ARC-AGI-2 has the largest reported gap to human performance and is the cleanest test of pattern-generalisation capability. HLE is the broadest expert-difficulty test with substantial headroom. Together, the three give a complete reasoning portfolio. For current per-model scores on any of them, read the benchmark's own leaderboard rather than a reprinted figure.
Benchmark-by-benchmark comparison
The full picture of reasoning benchmarks in 2026 spans the headline three plus several specialised options. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.
We deliberately do not reprint a current-frontier score column. Cross-paper benchmark numbers carry a several-point comparability margin and the official boards lag the current frontier, so any single reprinted percentage would be unreliable. Take each score from the benchmark's own published results under a disclosed harness, or use the per-task picker on the homepage.
Use-case-by-use-case selection guide
The right benchmark depends on the reasoning slice. Graduate-domain reasoning, pattern recognition, frontier breadth, and math-specific reasoning are different evaluation targets. The table below maps common reasoning-evaluation use cases to the benchmarks that best answer them.
GPQA-Diamond and graduate-level reasoning
GPQA-Diamond is the 198-question diamond subset of the GPQA (Google-Proof QA) benchmark, focused on the highest-quality, most-discriminating questions in physics, biology, and chemistry. Each question was written by a domain expert (typically a PhD-level researcher in the relevant field) and validated to be Google-resistant: a non-expert with internet access cannot solve it within 30 minutes. The questions require domain knowledge and multi-step reasoning that genuine expertise enables.
The original GPQA paper reports expert-human accuracy on the diamond subset at roughly 65 percent for graduate-level domain experts working alone, and the strongest current models have reported results above that human baseline on this specific 198-question Google-proof multiple-choice format. Read that carefully: it does not mean models are better scientists, only that they score higher on this narrow format. For exact current numbers, consult the benchmark's published results.
GPQA-Diamond is approaching saturation faster than HLE. Frontier reasoning-focused models now cluster closely on it, so its discriminating power for top-tier comparison is narrowing. Within a couple of years the benchmark will likely lose much of that power and the field will shift to harder successors; HLE is the most natural one.
ARC-AGI-2 and the pattern-reasoning headroom
ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus from François Chollet, released in 2024. The benchmark contains visual abstract-reasoning puzzles: given a few input-output examples of a pattern transformation, the model must apply the transformation to a new input. The puzzles test pattern recognition, generalisation, and abstract reasoning in ways that are difficult to memorise from training data because the underlying patterns are novel.
ARC-AGI-2 reports one of the largest gaps to human performance among current major reasoning benchmarks, which makes it the clearest test of where current models are still weak relative to humans. The benchmark's value is precisely this gap: it identifies a real capability deficit that other benchmarks are saturating away. For the current model-versus-human standings, see the ARC Prize leaderboard rather than a reprinted figure.
ARC-AGI-2 has been resistant to brute-force scaling: frontier model improvements on other benchmarks have not translated proportionally to ARC-AGI-2 gains. This is consistent with Chollet's framing of the benchmark as testing intelligence-as-skill-acquisition rather than intelligence-as-task-performance. The benchmark's headroom and difficulty pattern make it one of the most interesting reasoning benchmarks to watch through 2027 and 2028.
Humanity's Last Exam: the broad frontier test
Humanity's Last Exam covers expert-curated frontier-difficulty questions across mathematics, physics, chemistry, biology, computer science, classical languages, philosophy, and other expert domains. With around 2,500 questions and substantial headroom for current models, it is the broadest frontier-difficulty knowledge-and-reasoning benchmark currently in active use.
HLE's most distinctive feature for reasoning evaluation is the gap it tends to show between reasoning-focused models and general models, which is wider than on MMLU-Pro or GPQA-Diamond. This makes HLE one of the cleanest single benchmarks for differentiating reasoning-specialised models from general-purpose models. Take the actual scores from its published results.
For the headline reasoning claim about a frontier model in 2026, we recommend quoting GPQA-Diamond (graduate scientific reasoning, near-saturation discriminator) and HLE (broad frontier reasoning, large headroom) together. ARC-AGI-2 is the right addition for claims about generalisation specifically. The combination gives the most complete reasoning picture available.
Math benchmarks and BIG-Bench Hard
Math-specific benchmarks (MATH, AIME) are reasoning benchmarks in the narrow sense: they test multi-step mathematical reasoning. MATH (12,500 problems) is the broader benchmark, ranging from basic algebra to competition-level proofs, and is broadly saturated for frontier models. AIME (American Invitational Mathematics Examination) is the harder benchmark, but the per-year sample size is small (15-30 problems per AIME) and noise is consequential. Quote AIME for hard math claims; quote MATH for broader math sanity checks; take the numbers from the relevant published results.
BIG-Bench Hard is a 23-task subset of the original BIG-Bench evaluation suite, focused on tasks where 2022-era models scored below human baseline. The subset includes logical deduction, causal reasoning, formal fallacies, and other reasoning-flavoured tasks. Many BBH tasks are now partially saturated by 2026 frontier models; the benchmark remains useful as a multi-task reasoning sanity check but is less discriminating than the headline three for frontier comparisons. Quote BBH when multi-task breadth is the question; for frontier reasoning specifically, prefer GPQA-Diamond or HLE.
Q.01Which reasoning benchmark should I quote in 2026?+
Q.02What is GPQA-Diamond and why is it important?+
Q.03What is ARC-AGI-2 and what does it measure?+
Q.04How does Humanity's Last Exam compare to GPQA-Diamond?+
Q.05Are math benchmarks (MATH, AIME) reasoning benchmarks?+
Q.06What about BIG-Bench Hard?+
Sources
- [1] Rein, D. et al. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022.
- [2] Chollet, F. ARC-AGI-2 project site. arcprize.org. Accessed May 2026.
- [3] Phan, L. et al. (2025). Humanity's Last Exam. lastexam.ai.
- [4] Hendrycks, D. et al. (2021). Measuring Mathematical Problem Solving with the MATH Dataset. arXiv:2103.03874.
- [5] Suzgun, M. et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261. The BIG-Bench Hard paper.