Abstract

What~2,500 expert-authored frontier-difficulty questions across mathematics, sciences, humanities, languages

WhoPhan et al., CAIS + Scale AI, 2025 (lastexam.ai)

2026 TierFrontier 22-26% (reasoning-focused models lead).

Section I.vii · Frontier Knowledge|Last verified April 2026

Humanity's Last Exam: The Knowledge Benchmark Built to Last

The benchmark designed to outlast the current frontier. 2,500 expert-authored questions across mathematics, sciences, humanities, and classical languages where the strongest 2026 models score 22-26 percent. The kind of headroom that means three to four years of useful discrimination ahead.

What Humanity's Last Exam is

Humanity's Last Exam (HLE), released by the Center for AI Safety and Scale AI in January 2025, is a frontier-difficulty knowledge benchmark built explicitly to outlast current models. The 2,500-question test set was curated through Scale AI's expert network: domain specialists in mathematics, physics, chemistry, biology, computer science, classical languages, history, philosophy, and other graduate-level fields each contributed questions that they believed were hard for someone with PhD-level training in the field.

The launch frontier scored under 5 percent. By May 2026 the frontier has reached the low 20s percent, which is real progress but well short of saturation. The benchmark's headline goal, named in the launch paper, is to mark the point at which AI systems exceed expert-human capability on closed-domain knowledge tasks. We are not there yet; HLE has at least 2-3 years of useful headroom on current trajectories.

What distinguishes HLE from earlier knowledge benchmarks is the curation method. MMLU and MMLU-Pro pulled questions from existing exams and textbooks, which exposed the questions to pre-training corpora. GPQA pioneered Google-resistant graduate questions in three sciences. HLE extends this approach to a wider domain set and applies a more rigorous expert-curation process. Each question was reviewed by at least two experts in the relevant field; questions that any reviewer flagged as Google-resolvable or memorisation-likely were rejected.

Domain coverage

HLE's six domain clusters span the full undergraduate-to-PhD knowledge range with a tilt toward technical fields. Mathematics is the largest single domain; classical languages is the most distinctive (no other knowledge benchmark of comparable size includes ancient Greek or Latin translation).

Domain cluster

What it covers

Mathematics

Proof-based problems, algebra, analysis, combinatorics. The largest single domain. Strong models score in the high 20s; the rest score below 15.

Physics, chemistry, biology

Graduate-level material across the three traditional sciences. Overlaps in spirit with GPQA-Diamond but with deeper specialisation.

Computer science

Theoretical CS, advanced algorithms, complexity theory, proof techniques. Distinct from coding benchmarks (LiveCodeBench, HumanEval) which test generation.

Classical languages

Translation and analysis of Greek, Latin, and other classical languages. Tests language depth rather than fluency in a way modern-language benchmarks do not.

History and humanities

Specific historical facts, philosophical arguments, art history. Tests knowledge depth across niche fields.

Other specialised

Ecology, geology, linguistics, music theory, niche engineering. The long tail; small per-category sample sizes mean per-category scores are noisy.

Per-domain scores vary widely. Strong models score in the high 20s on mathematics and computer science, the high teens on classical languages, and below 10 on niche specialised fields where sample sizes are small. The honest read on a model's HLE score is the per-domain breakdown plus the overall, not the overall alone.

Question format and scoring

HLE questions come in two formats. Multiple-choice questions follow the same convention as MMLU-Pro (5 to 10 options, exact-match scoring). Short-answer questions require the model to produce a free-form answer that is then graded against an expert-provided rubric. The mix is roughly 70 percent multiple choice, 30 percent short answer; the short-answer fraction is higher in mathematics and humanities, where multiple-choice formats can leak information.

Short-answer scoring uses a structured rubric: the expert who authored the question specifies the acceptable answer or set of answers, and graders (humans or LLM-as-judge with verification) check submissions against the rubric. The grading is the most operationally complex part of the benchmark; the public leaderboard uses an LLM-as-judge fallback for short-answer questions, which introduces a small amount of noise but is necessary for scale. The CAIS team periodically audits judge agreement against expert re-grading and reports the inter-rater reliability.

The headline score is the unweighted mean accuracy across all 2,500 questions. Per-domain scores and per-format scores (multiple choice vs short answer) are also reported. As with all knowledge benchmarks, per-category breakdown is more informative than the overall number.

SOTA progression Jan 2025 to May 2026

HLE scores have moved fast since launch in absolute terms but remain low in absolute level. The frontier has gone from under 5 percent to over 20 percent in 16 months, an order-of-magnitude faster relative-rate improvement than MMLU-Pro's saturation curve. This reflects two things: the launch level was deliberately set very low, and the gap between general models and reasoning-focused models is larger on HLE than on easier benchmarks.

Date

Tier

Note

Jan 2025

Launch, frontier under 5%

CAIS + Scale AI release; deliberately set to be hard for current models.

Apr 2025

Frontier reaches 8%

First reasoning-focused models post higher scores than general models.

Sep 2025

Frontier crosses 12%

Strong CoT-enabled scaffolds with verification loops.

Jan 2026

Frontier at 18%

Reasoning-focused models open clearer gap over general models.

May 2026

Frontier 22-26% (reasoning models lead)

Real headroom remains; saturation is years away.

The reasoning-model gap

HLE is the knowledge benchmark where the gap between reasoning-focused models (specialised for multi-step problem solving with extended chain-of-thought) and general models is most visible. The strongest reasoning models score 24-26 percent in May 2026; the strongest non-reasoning models score 18-20. The 6-8 point gap is much larger than the equivalent gap on MMLU-Pro (typically 1-2 points) or GPQA-Diamond (typically 3-4 points).

This is a useful signal. Many HLE questions require multi-step reasoning rather than single-step recall: a mathematical question that requires constructing a proof, a classical-language question that requires parsing a complex sentence, a philosophy question that requires evaluating an argument structure. Reasoning-focused models have the architectural and training advantage on these formats, and HLE rewards them disproportionately.

When citing HLE scores, always disclose the reasoning configuration: chain-of-thought enabled or not, extended-thinking budget if applicable, single-shot or multi-attempt. A reasoning model with extended CoT and a non-reasoning model in single-shot mode are not directly comparable; the gap can be 10+ points in either direction depending on configuration.

Strengths and limits

Strengths: expert-curated, contamination-resistant by construction, per-domain reporting, well-instrumented for short-answer grading, large headroom remaining (saturation is years away). HLE is the cleanest frontier-difficulty knowledge benchmark in active use in 2026.

Limits: 2,500 questions is small relative to MMLU-Pro's 12,000, so per-domain noise is real. The expert-curation process is expensive and limits how often the test set can be refreshed. Short-answer LLM-as-judge grading introduces a small amount of noise. The benchmark is English-centric (with a partial exception for classical languages); multilingual capability is not directly measured. The domain distribution is biased toward technical fields where expert curation is more tractable; humanities and social sciences are under-represented relative to their academic prevalence.

The most important caveat is that HLE measures one thing very well (frontier-difficulty knowledge) and other things not at all (general capability, agent capability, real-world workflow competence). A model with a strong HLE score is not necessarily a stronger assistant or engineer than a model with a weaker HLE score; it is more likely to be a strong specialised reasoner. The right use is alongside MMLU-Pro (breadth), GPQA-Diamond (graduate-level reasoning in sciences), SWE-bench Verified (engineering), GAIA (assistant work), and at least one preference signal like Chatbot Arena.

When to use HLE in 2026

Quote HLE when the question is "how strong is this model on frontier-difficulty knowledge and reasoning across many domains". It is the clearest discriminator currently available between frontier reasoning-focused models. It is the wrong benchmark for breadth (use MMLU-Pro), for engineering (use SWE-bench Verified), for agent capability (use GAIA, OSWorld, Tau-Bench), or for preference (use Chatbot Arena).

Quote the overall score, the reasoning-vs-non-reasoning configuration, and the per-domain breakdown if claiming domain-specific competence. Treat sub-2-point differences as noise; treat sub-5-point differences as suggestive but not conclusive. The benchmark's value is the slope of progress, not any individual snapshot. See our reasoning-benchmark comparison for the wider landscape.

Editor's verdictHumanity's Last Exam is the clearest frontier-difficulty knowledge benchmark in 2026. Real headroom (frontier under 25 percent), real expert curation, real reasoning-model differentiation. Quote it for any serious model comparison; expect it to remain useful for years.

Reader Questions

Q.01What is Humanity's Last Exam?+

Humanity's Last Exam (HLE) is a frontier-difficulty knowledge benchmark released in January 2025 by Center for AI Safety (CAIS) and Scale AI. It contains roughly 2,500 multiple-choice and short-answer questions across mathematics, physics, chemistry, biology, computer science, ecology, classical languages, ancient history, philosophy, and other graduate-level domains. Each question was authored by a domain expert and is intended to be hard for someone with PhD-level training in the field. The benchmark explicitly targets the territory where MMLU, MMLU-Pro, and GPQA-Diamond have lost discrimination.

Q.02Why is HLE called Humanity's Last Exam?+

The naming reflects the underlying premise: this is a benchmark built to be the last hard knowledge-test that current frontier models can attempt. The launch frontier scored under 5 percent. The expectation in the launch paper was that solving HLE would mark the point at which AI systems exceed expert-human capability on closed-domain knowledge questions. By May 2026 the frontier has reached the low 20s percent, which is meaningful progress but well short of saturation, suggesting the benchmark has at least 2-3 years of useful headroom remaining.

Q.03How does HLE differ from MMLU-Pro and GPQA-Diamond?+

MMLU-Pro tests broad undergraduate-to-graduate knowledge across 14 fields; current frontier scores around 88 percent. GPQA-Diamond tests Google-resistant graduate questions in physics, biology, chemistry; current frontier scores around 75 percent. HLE tests genuinely frontier-difficulty questions across a wider domain range; current frontier scores around 22 percent. The three benchmarks form a difficulty hierarchy: MMLU-Pro for breadth, GPQA-Diamond for graduate-level reasoning in three specific sciences, HLE for true expert-level questions across a broader domain set. Quote all three for serious frontier comparisons.

Q.04What kinds of questions does HLE include?+

Questions range from technical mathematics (proof-based problems, advanced algebra) to obscure history (specific dates, lesser-known figures), specialised biology (specific molecular pathways, taxonomy), classical languages (translation of Greek or Latin passages), philosophy (specific arguments and counter-arguments), and computer science (advanced algorithm complexity, theoretical CS). The breadth is intentional: the benchmark is testing whether models have expert-level depth across many domains, not whether they have memorised a specific corpus.

Q.05Is HLE contamination-resistant?+

Mostly. The questions are newly authored and curated through Scale AI's expert network rather than scraped from existing test sets. The CAIS team rotates a fraction of the test set periodically and holds a private validation split. However, the underlying domain knowledge (e.g. classical Greek grammar, advanced algebra theorems) is well-represented in pre-training data, so a model that knows the field well will perform partly through genuine knowledge and partly through prior exposure. The contamination risk is structural-low but not zero. The benchmark is the cleanest knowledge benchmark in active use; see /benchmark-contamination for the wider issue.

Q.06What is the current HLE SOTA and where will it go?+

As of May 2026 the frontier on HLE sits around 22-25 percent overall. The strongest reasoning-focused models with chain-of-thought enabled score 24-26 percent; the strongest non-reasoning models score 18-20 percent. This is a meaningful capability signal: the gap between reasoning-style models (specialised for multi-step problem solving) and general models is more visible on HLE than on any other current knowledge benchmark. We expect frontier to reach 40-50 percent by 2028 if current trajectories continue; saturation (above 90 percent) is unlikely before 2030.

MMLU-Pro →GPQA and ARC-AGI →Reasoning Benchmarks Compared →Chatbot Arena →Full Benchmark Reference →Benchmark Contamination →What Benchmarks Miss →

Sources

[1] Phan, L. et al. (2025). Humanity's Last Exam. Released by Center for AI Safety and Scale AI. Project: lastexam.ai.
[2] Center for AI Safety project page. safe.ai. Accessed May 2026.
[3] Scale AI Research Blog: HLE methodology and expert-curation process. scale.com/research.