GPQA, ARC-AGI, and Humanity's Last Exam - The Reasoning Frontier 2026
MMLU is saturated. HumanEval is saturated. BIG-Bench Hard is approaching saturation. The benchmarks that still have real headroom in 2026 are GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. These are the benchmarks frontier model comparisons should be made on.
GPQA - Graduate-Level Google-Proof Q&A
GPQA (Graduate-Level Google-Proof Q&A) was created by David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman (various institutions including New York University), published in 2023.
The benchmark contains 448 multiple-choice questions across biology, physics, and chemistry. Crucially, questions were written by PhD researchers with the explicit requirement that they be impossible to answer correctly by searching the web - you cannot Google your way to the right answer. Expert respondents (PhDs in the relevant field) answered correctly approximately 65% of the time on the Diamond subset. Non-experts with internet access answered correctly only 34% of the time.
GPQA-Diamond is the 198-question hardest subset, curated from the full 448 by selecting questions where expert disagreement was highest. This is the canonical version for frontier model comparison. As of April 2026, GPT-5 leads at 78.4%, Claude 4.5 Opus is at 76.3%, and Gemini 2.5 Pro is at 74.1%.
GPQA-Diamond is approaching saturation but has not yet reached it. The 7-8 percentage point gap between frontier models provides meaningful discrimination. Within 1-2 years, as models continue to improve, GPQA-Diamond is expected to follow MMLU into the saturated category.
| Model | GPQA-Diamond | GPQA Main | Captured |
|---|---|---|---|
| GPT-5 | 78.4% | 84.2% | Apr 2026 |
| Claude 4.5 Opus | 76.3% | 81.7% | Apr 2026 |
| Grok 4 | 72.8% | 79.3% | Apr 2026 |
| Gemini 2.5 Pro | 74.1% | 80.2% | Apr 2026 |
| Claude 4 Sonnet | 71.2% | 77.8% | Apr 2026 |
| Expert human baseline | 65.0% | 69.0% | 2023 (paper) |
| 0-shot CoT. Sources: vendor model cards, Papers With Code. Expert baseline from Rein et al. 2023 paper. | |||
ARC-AGI and ARC-AGI-2
ARC (Abstraction and Reasoning Corpus) was created by Francois Chollet at Google (2019) as a challenge to conventional AI benchmarking. Chollet's argument: intelligence should be measured by the ability to learn new skills from minimal examples, not by performance on tasks that can be memorised from training data. ARC tasks present a small number of input-output grid examples and ask the system to apply the inferred rule to a new input.
The 85% human baseline is the key reference point. Humans solve 85% of ARC tasks without prior exposure. Early frontier AI models scored near 0%. The benchmark became one of the most discussed AI challenges because it resisted every technique that worked on other benchmarks - CoT, tool use, fine-tuning.
In December 2024, OpenAI's o3 model scored 87.5% on the private ARC-AGI test set (above the human baseline), using extended compute (roughly $6,000 of inference per task at high-efficiency compute prices). This breakthrough triggered the development of ARC-AGI-2.
ARC-AGI-2 (Chollet, 2025) introduces significantly harder tasks designed to resist the high-compute, chain-of-thought techniques that cracked the original. As of April 2026, SOTA is 65.8% (GPT-5), far below the human baseline, and the gap between models is substantial.
Humanity's Last Exam
Humanity's Last Exam (HLE) was created by Scale AI and the Center for AI Safety (2025). The benchmark contains 3,000 questions written by domain experts across mathematics, sciences, and the humanities, specifically designed to remain hard for frontier models for several years.
The name reflects a hypothesis: this will be the last benchmark on which human experts consistently outperform AI models. As of April 2026, frontier models score approximately 18-20%, while domain experts who wrote the questions score 60-75%. The gap is substantial and the benchmark has enormous headroom.
HLE is the benchmark to watch for frontier model comparisons over the next 2-3 years. When GPQA-Diamond saturates (expected within 12-18 months at current progress rates), HLE is the natural successor. The 18% SOTA also means that current frontier models fail on more than 80% of expert-written questions - a useful reminder that even impressive models have significant gaps.
Why These Matter in 2026
The benchmark saturation problem is real and accelerating. A benchmark that can no longer discriminate between frontier models is useless for the most important use case - choosing a model for a specific deployment. In 2026, the benchmarks with real headroom are: GPQA-Diamond (7-8 point gap between frontier models), ARC-AGI-2 (15+ point gap), and Humanity's Last Exam (50+ point gap between frontier models and expert humans).
Any model comparison that does not include at least one of these three benchmarks is likely relying on saturated metrics. When a vendor card leads with MMLU, HumanEval, or BIG-Bench Hard, ask why they did not include GPQA-Diamond or ARC-AGI-2.
Frequently Asked Questions
What does ARC-AGI actually measure?+
Why is HLE called Humanity's Last Exam?+
What does a GPQA-Diamond score mean?+
Is ARC-AGI a test of AGI?+
Sources
- [1] Rein et al., GPQA - arxiv.org/abs/2311.12022 - 2023
- [2] Chollet, ARC Challenge - arxiv.org/abs/1911.01547 - 2019
- [3] ARC Prize / ARC-AGI-2 - arcprize.org - Captured April 2026
- [4] Humanity's Last Exam - agi.safe.ai/hle - Captured April 2026
- [5] Vendor model cards (GPT-5, Claude 4.5 Opus, Gemini 2.5 Pro) - Captured April 2026