Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
MMLU (saturated)Frontier in the low 90s
MMLU-ProFrontier in the high 80s
MMMU (multimodal)Frontier in the high 70s to low 80s
CapturedApril 2026
Section II.i · Knowledge Benchmarks|Last verified April 2026

MMLU, MMLU-Pro, and MMMU: The Knowledge Benchmarks in 2026

The benchmark that defined an era, the successor that fixed it, and the multimodal sibling that extended it.

MMLU (Massive Multitask Language Understanding) was the defining benchmark of the 2020 to 2023 era. By 2026 it is saturated and should not be used for frontier model comparisons. MMLU-Pro is the current standard. MMMU extends the concept to multimodal understanding. This page explains all three, when to use each, and what the contamination concerns mean in practice.

Section II.i.1

Original MMLU

MMLU was created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (UC Berkeley, University of Chicago), published in 2020. The benchmark contains 15,908 multiple-choice questions across 57 subjects, from elementary mathematics to professional law and medicine. Questions have 4 answer choices. The standard evaluation is 5-shot.

MMLU was transformative because it established a broad standardised benchmark when evaluation was fragmented. GPT-3.5 scored around 70% in 2022. GPT-4 scored 86.4% in 2023, triggering widespread coverage. That 86.4% figure is still quoted in 2026 as if it represents the frontier. It does not. As of April 2026, the best frontier models have moved into the low 90s.

Saturation verdictMMLU is saturated. Frontier models cluster in a tight band in the low 90s, less than two percentage points apart, well within measurement noise for a 15,908-question test. The benchmark can no longer discriminate frontier models. Cite it only when comparing with pre-2024 models for historical analysis.
Section II.i.2

MMLU-Pro

MMLU-Pro was created by Yubo Wang, Xueguang Ma, Ge Zhang, and Jiale Yan (TIGER-Lab, University of Waterloo and others), published in 2024. It redesigns MMLU to address two specific problems: 4-choice saturation and insufficient reasoning depth.

Key changes. MMLU-Pro has 10 answer choices per question (not 4). This makes process-of-elimination far less effective and forces more careful reasoning. The questions were filtered for higher difficulty and curated from a wider pool including textbooks, STEM exams, and competition problems. MMLU-Pro was designed assuming chain-of-thought prompting, the standard evaluation uses CoT, and scores drop significantly without it.

The dataset contains 12,000 questions. As of April 2026, the frontier band sits in the high 80s, with eight to ten percentage points of spread between the top-tier and the next tier of frontier models. That spread provides meaningful discrimination. MMLU-Pro is the current standard for knowledge benchmarking in 2026.

Section II.i.3

MMMU: Multimodal Understanding

MMMU (Massive Multitask Multimodal Understanding) was created by Xiang Yue et al. in 2024 as the multimodal equivalent of MMLU. The benchmark contains 11,550 questions drawn from college exams and textbooks across 30 subjects, with each question requiring interpretation of one or more images alongside the text.

MMMU's images span diverse types: diagrams, charts, scientific figures, photographs, maps, music scores, and chemical structures. The multimodal requirement captures a capability dimension that text-only benchmarks entirely miss; a model can score in the low 90s on MMLU-Pro while failing to interpret a basic physics diagram. For evaluating vision-language models, MMMU is the current standard.

As of April 2026 the frontier sits in the high 70s to low 80s, with three to five percentage points between leaders. The gap provides useful discrimination. Unlike MMLU, MMMU has not yet saturated.

Section II.i.4

Known Contamination Issues

MMLU has significant contamination concerns. The test questions were sourced from freely available study materials; many appear verbatim or near-verbatim on websites included in Common Crawl, the primary web-crawl training dataset used by most frontier models. Researchers have documented specific MMLU test questions appearing in training data for several frontier model lineages.

The practical consequence: a model scoring in the low 90s on MMLU may be recalling test questions it saw during training rather than reasoning through them. This is one reason MMLU scores cluster so tightly at the high end; models may have converged on the training data rather than on the underlying reasoning capability.

MMLU-Pro partially addresses contamination by sourcing questions from a wider pool with less overlap with common training corpora. MMMU has lower contamination risk because image content is less frequently included in text-focused training crawls. Neither benchmark has zero contamination risk; all public test sets face this problem structurally.

Reader Questions
Q.01Is a 90% MMLU score impressive in 2026?+
No. By April 2026, frontier models cluster in the low 90s on MMLU and the benchmark is saturated. A 90% score in 2026 means the model is at or below the current frontier band. For meaningful comparisons, use MMLU-Pro (frontier in the high 80s) or GPQA-Diamond (frontier in the high 70s). MMLU is only useful as historical comparison with pre-2024 models.
Q.02What subjects does MMLU cover?+
MMLU covers 57 subjects across STEM, humanities, social sciences, and professional domains. STEM subjects include mathematics, physics, chemistry, biology, computer science, and engineering. Humanities include history, philosophy, and law. Professional domains include medicine, accounting, and business. Each subject has between 100 and 500+ questions.
Q.03What is the difference between MMLU and MMLU-Pro?+
MMLU has 4 answer choices per question; MMLU-Pro has 10. MMLU allows models to score well by process of elimination; MMLU-Pro requires more deliberate reasoning. MMLU permits 5-shot without CoT; MMLU-Pro was designed assuming chain-of-thought prompting. MMLU-Pro is harder and discriminates frontier models in 2026.
Q.04Does MMLU predict real-world LLM usefulness?+
Weakly. MMLU correlates with general knowledge breadth. But MMLU does not test reasoning under ambiguity, long-context comprehension, instruction following, or agentic capability, all of which matter more for practical deployment. Use MMLU as a floor test, not a quality certificate.
All benchmarksWhat benchmarks missGPQA / ARC-AGI: the reasoning frontier

Sources

  1. [1] Hendrycks et al., MMLU · arxiv.org/abs/2009.03300 · 2020
  2. [2] Wang et al., MMLU-Pro · arxiv.org/abs/2406.01574 · 2024
  3. [3] Yue et al., MMMU · mmmu-benchmark.github.io · 2024
  4. [4] HuggingFace Open LLM Leaderboard v2 · Captured April 2026
  5. [5] Papers With Code MMLU · paperswithcode.com · Captured April 2026
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.