Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
TL;DR - The Reasoning Frontier
GPQA-Diamond SOTA: 78.4% (GPT-5)
ARC-AGI-2 SOTA: 65.8% (GPT-5)
HLE SOTA: 18.4% (GPT-5)
Last verified April 2026

GPQA, ARC-AGI, and Humanity's Last Exam - The Reasoning Frontier 2026

MMLU is saturated. HumanEval is saturated. BIG-Bench Hard is approaching saturation. The benchmarks that still have real headroom in 2026 are GPQA-Diamond, ARC-AGI-2, and Humanity's Last Exam. These are the benchmarks frontier model comparisons should be made on.

GPQA - Graduate-Level Google-Proof Q&A

GPQA (Graduate-Level Google-Proof Q&A) was created by David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman (various institutions including New York University), published in 2023.

The benchmark contains 448 multiple-choice questions across biology, physics, and chemistry. Crucially, questions were written by PhD researchers with the explicit requirement that they be impossible to answer correctly by searching the web - you cannot Google your way to the right answer. Expert respondents (PhDs in the relevant field) answered correctly approximately 65% of the time on the Diamond subset. Non-experts with internet access answered correctly only 34% of the time.

GPQA-Diamond is the 198-question hardest subset, curated from the full 448 by selecting questions where expert disagreement was highest. This is the canonical version for frontier model comparison. As of April 2026, GPT-5 leads at 78.4%, Claude 4.5 Opus is at 76.3%, and Gemini 2.5 Pro is at 74.1%.

GPQA-Diamond is approaching saturation but has not yet reached it. The 7-8 percentage point gap between frontier models provides meaningful discrimination. Within 1-2 years, as models continue to improve, GPQA-Diamond is expected to follow MMLU into the saturated category.

ModelGPQA-DiamondGPQA MainCaptured
GPT-578.4%84.2%Apr 2026
Claude 4.5 Opus76.3%81.7%Apr 2026
Grok 472.8%79.3%Apr 2026
Gemini 2.5 Pro74.1%80.2%Apr 2026
Claude 4 Sonnet71.2%77.8%Apr 2026
Expert human baseline65.0%69.0%2023 (paper)
0-shot CoT. Sources: vendor model cards, Papers With Code. Expert baseline from Rein et al. 2023 paper.

ARC-AGI and ARC-AGI-2

ARC (Abstraction and Reasoning Corpus) was created by Francois Chollet at Google (2019) as a challenge to conventional AI benchmarking. Chollet's argument: intelligence should be measured by the ability to learn new skills from minimal examples, not by performance on tasks that can be memorised from training data. ARC tasks present a small number of input-output grid examples and ask the system to apply the inferred rule to a new input.

The 85% human baseline is the key reference point. Humans solve 85% of ARC tasks without prior exposure. Early frontier AI models scored near 0%. The benchmark became one of the most discussed AI challenges because it resisted every technique that worked on other benchmarks - CoT, tool use, fine-tuning.

In December 2024, OpenAI's o3 model scored 87.5% on the private ARC-AGI test set (above the human baseline), using extended compute (roughly $6,000 of inference per task at high-efficiency compute prices). This breakthrough triggered the development of ARC-AGI-2.

ARC-AGI-2 (Chollet, 2025) introduces significantly harder tasks designed to resist the high-compute, chain-of-thought techniques that cracked the original. As of April 2026, SOTA is 65.8% (GPT-5), far below the human baseline, and the gap between models is substantial.

Humanity's Last Exam

Humanity's Last Exam (HLE) was created by Scale AI and the Center for AI Safety (2025). The benchmark contains 3,000 questions written by domain experts across mathematics, sciences, and the humanities, specifically designed to remain hard for frontier models for several years.

The name reflects a hypothesis: this will be the last benchmark on which human experts consistently outperform AI models. As of April 2026, frontier models score approximately 18-20%, while domain experts who wrote the questions score 60-75%. The gap is substantial and the benchmark has enormous headroom.

HLE is the benchmark to watch for frontier model comparisons over the next 2-3 years. When GPQA-Diamond saturates (expected within 12-18 months at current progress rates), HLE is the natural successor. The 18% SOTA also means that current frontier models fail on more than 80% of expert-written questions - a useful reminder that even impressive models have significant gaps.

Why These Matter in 2026

The benchmark saturation problem is real and accelerating. A benchmark that can no longer discriminate between frontier models is useless for the most important use case - choosing a model for a specific deployment. In 2026, the benchmarks with real headroom are: GPQA-Diamond (7-8 point gap between frontier models), ARC-AGI-2 (15+ point gap), and Humanity's Last Exam (50+ point gap between frontier models and expert humans).

Any model comparison that does not include at least one of these three benchmarks is likely relying on saturated metrics. When a vendor card leads with MMLU, HumanEval, or BIG-Bench Hard, ask why they did not include GPQA-Diamond or ARC-AGI-2.

Frequently Asked Questions

What does ARC-AGI actually measure?+
ARC-AGI measures the ability to induce abstract rules from visual examples and apply them to new cases. Each task shows a few input-output grid pairs, and the model must identify the underlying transformation rule and apply it to a new input. Humans find these tasks straightforward (average 85%). ARC-AGI-2 is a harder version designed to resist the techniques that cracked the original in December 2024.
Why is HLE called Humanity's Last Exam?+
Humanity's Last Exam is named for the hypothesis that it will be the last benchmark where human experts consistently outperform frontier AI. The 3,000 questions were written by domain experts specifically to be hard for AI. Current SOTA is around 18%, but this is expected to rise over the next 2-3 years as models improve.
What does a GPQA-Diamond score mean?+
GPQA-Diamond tests graduate-level scientific reasoning across biology, chemistry, and physics. A 70% GPQA-Diamond score means the model correctly answers 70% of questions that stump most domain experts. Human expert baseline is approximately 65% on GPQA-Diamond, meaning frontier models now exceed typical PhD-level performance on this specific benchmark.
Is ARC-AGI a test of AGI?+
ARC-AGI measures a specific capability (visual abstract reasoning with minimal training examples) that its creator argues is more aligned with general intelligence than knowledge recall. Scoring 85% on ARC-AGI does not mean a system is generally intelligent. The benchmark is valuable for measuring this narrow but challenging capability.

Sources

  1. [1] Rein et al., GPQA - arxiv.org/abs/2311.12022 - 2023
  2. [2] Chollet, ARC Challenge - arxiv.org/abs/1911.01547 - 2019
  3. [3] ARC Prize / ARC-AGI-2 - arcprize.org - Captured April 2026
  4. [4] Humanity's Last Exam - agi.safe.ai/hle - Captured April 2026
  5. [5] Vendor model cards (GPT-5, Claude 4.5 Opus, Gemini 2.5 Pro) - Captured April 2026