Humanity's Last Exam: The Knowledge Benchmark Built to Last
The benchmark designed to outlast the current frontier. 2,500 expert-authored questions across mathematics, sciences, humanities, and classical languages where the strongest 2026 models score 22-26 percent. The kind of headroom that means three to four years of useful discrimination ahead.
What Humanity's Last Exam is
Humanity's Last Exam (HLE), released by the Center for AI Safety and Scale AI in January 2025, is a frontier-difficulty knowledge benchmark built explicitly to outlast current models. The 2,500-question test set was curated through Scale AI's expert network: domain specialists in mathematics, physics, chemistry, biology, computer science, classical languages, history, philosophy, and other graduate-level fields each contributed questions that they believed were hard for someone with PhD-level training in the field.
The launch frontier scored under 5 percent. By May 2026 the frontier has reached the low 20s percent, which is real progress but well short of saturation. The benchmark's headline goal, named in the launch paper, is to mark the point at which AI systems exceed expert-human capability on closed-domain knowledge tasks. We are not there yet; HLE has at least 2-3 years of useful headroom on current trajectories.
What distinguishes HLE from earlier knowledge benchmarks is the curation method. MMLU and MMLU-Pro pulled questions from existing exams and textbooks, which exposed the questions to pre-training corpora. GPQA pioneered Google-resistant graduate questions in three sciences. HLE extends this approach to a wider domain set and applies a more rigorous expert-curation process. Each question was reviewed by at least two experts in the relevant field; questions that any reviewer flagged as Google-resolvable or memorisation-likely were rejected.
Domain coverage
HLE's six domain clusters span the full undergraduate-to-PhD knowledge range with a tilt toward technical fields. Mathematics is the largest single domain; classical languages is the most distinctive (no other knowledge benchmark of comparable size includes ancient Greek or Latin translation).
Per-domain scores vary widely. Strong models score in the high 20s on mathematics and computer science, the high teens on classical languages, and below 10 on niche specialised fields where sample sizes are small. The honest read on a model's HLE score is the per-domain breakdown plus the overall, not the overall alone.
Question format and scoring
HLE questions come in two formats. Multiple-choice questions follow the same convention as MMLU-Pro (5 to 10 options, exact-match scoring). Short-answer questions require the model to produce a free-form answer that is then graded against an expert-provided rubric. The mix is roughly 70 percent multiple choice, 30 percent short answer; the short-answer fraction is higher in mathematics and humanities, where multiple-choice formats can leak information.
Short-answer scoring uses a structured rubric: the expert who authored the question specifies the acceptable answer or set of answers, and graders (humans or LLM-as-judge with verification) check submissions against the rubric. The grading is the most operationally complex part of the benchmark; the public leaderboard uses an LLM-as-judge fallback for short-answer questions, which introduces a small amount of noise but is necessary for scale. The CAIS team periodically audits judge agreement against expert re-grading and reports the inter-rater reliability.
The headline score is the unweighted mean accuracy across all 2,500 questions. Per-domain scores and per-format scores (multiple choice vs short answer) are also reported. As with all knowledge benchmarks, per-category breakdown is more informative than the overall number.
SOTA progression Jan 2025 to May 2026
HLE scores have moved fast since launch in absolute terms but remain low in absolute level. The frontier has gone from under 5 percent to over 20 percent in 16 months, an order-of-magnitude faster relative-rate improvement than MMLU-Pro's saturation curve. This reflects two things: the launch level was deliberately set very low, and the gap between general models and reasoning-focused models is larger on HLE than on easier benchmarks.
The reasoning-model gap
HLE is the knowledge benchmark where the gap between reasoning-focused models (specialised for multi-step problem solving with extended chain-of-thought) and general models is most visible. The strongest reasoning models score 24-26 percent in May 2026; the strongest non-reasoning models score 18-20. The 6-8 point gap is much larger than the equivalent gap on MMLU-Pro (typically 1-2 points) or GPQA-Diamond (typically 3-4 points).
This is a useful signal. Many HLE questions require multi-step reasoning rather than single-step recall: a mathematical question that requires constructing a proof, a classical-language question that requires parsing a complex sentence, a philosophy question that requires evaluating an argument structure. Reasoning-focused models have the architectural and training advantage on these formats, and HLE rewards them disproportionately.
When citing HLE scores, always disclose the reasoning configuration: chain-of-thought enabled or not, extended-thinking budget if applicable, single-shot or multi-attempt. A reasoning model with extended CoT and a non-reasoning model in single-shot mode are not directly comparable; the gap can be 10+ points in either direction depending on configuration.
Strengths and limits
Strengths: expert-curated, contamination-resistant by construction, per-domain reporting, well-instrumented for short-answer grading, large headroom remaining (saturation is years away). HLE is the cleanest frontier-difficulty knowledge benchmark in active use in 2026.
Limits: 2,500 questions is small relative to MMLU-Pro's 12,000, so per-domain noise is real. The expert-curation process is expensive and limits how often the test set can be refreshed. Short-answer LLM-as-judge grading introduces a small amount of noise. The benchmark is English-centric (with a partial exception for classical languages); multilingual capability is not directly measured. The domain distribution is biased toward technical fields where expert curation is more tractable; humanities and social sciences are under-represented relative to their academic prevalence.
The most important caveat is that HLE measures one thing very well (frontier-difficulty knowledge) and other things not at all (general capability, agent capability, real-world workflow competence). A model with a strong HLE score is not necessarily a stronger assistant or engineer than a model with a weaker HLE score; it is more likely to be a strong specialised reasoner. The right use is alongside MMLU-Pro (breadth), GPQA-Diamond (graduate-level reasoning in sciences), SWE-bench Verified (engineering), GAIA (assistant work), and at least one preference signal like Chatbot Arena.
When to use HLE in 2026
Quote HLE when the question is "how strong is this model on frontier-difficulty knowledge and reasoning across many domains". It is the clearest discriminator currently available between frontier reasoning-focused models. It is the wrong benchmark for breadth (use MMLU-Pro), for engineering (use SWE-bench Verified), for agent capability (use GAIA, OSWorld, Tau-Bench), or for preference (use Chatbot Arena).
Quote the overall score, the reasoning-vs-non-reasoning configuration, and the per-domain breakdown if claiming domain-specific competence. Treat sub-2-point differences as noise; treat sub-5-point differences as suggestive but not conclusive. The benchmark's value is the slope of progress, not any individual snapshot. See our reasoning-benchmark comparison for the wider landscape.
Q.01What is Humanity's Last Exam?+
Q.02Why is HLE called Humanity's Last Exam?+
Q.03How does HLE differ from MMLU-Pro and GPQA-Diamond?+
Q.04What kinds of questions does HLE include?+
Q.05Is HLE contamination-resistant?+
Q.06What is the current HLE SOTA and where will it go?+
Sources
- [1] Phan, L. et al. (2025). Humanity's Last Exam. Released by Center for AI Safety and Scale AI. Project: lastexam.ai.
- [2] Center for AI Safety project page. safe.ai. Accessed May 2026.
- [3] Scale AI Research Blog: HLE methodology and expert-curation process. scale.com/research.