MMLU, MMLU-Pro, and MMMU: The Knowledge Benchmarks in 2026
The benchmark that defined an era, the successor that fixed it, and the multimodal sibling that extended it.
MMLU (Massive Multitask Language Understanding) was the defining benchmark of the 2020 to 2023 era. By 2026 it is saturated and should not be used for frontier model comparisons. MMLU-Pro is the current standard. MMMU extends the concept to multimodal understanding. This page explains all three, when to use each, and what the contamination concerns mean in practice.
Original MMLU
MMLU was created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (UC Berkeley, University of Chicago), published in 2020. The benchmark contains 15,908 multiple-choice questions across 57 subjects, from elementary mathematics to professional law and medicine. Questions have 4 answer choices. The standard evaluation is 5-shot.
MMLU was transformative because it established a broad standardised benchmark when evaluation was fragmented. GPT-3.5 scored around 70% in 2022. GPT-4 scored 86.4% in 2023, triggering widespread coverage. That 86.4% figure is still quoted in 2026 as if it represents the frontier. It does not. As of April 2026, the best frontier models have moved into the low 90s.
MMLU-Pro
MMLU-Pro was created by Yubo Wang, Xueguang Ma, Ge Zhang, and Jiale Yan (TIGER-Lab, University of Waterloo and others), published in 2024. It redesigns MMLU to address two specific problems: 4-choice saturation and insufficient reasoning depth.
Key changes. MMLU-Pro has 10 answer choices per question (not 4). This makes process-of-elimination far less effective and forces more careful reasoning. The questions were filtered for higher difficulty and curated from a wider pool including textbooks, STEM exams, and competition problems. MMLU-Pro was designed assuming chain-of-thought prompting, the standard evaluation uses CoT, and scores drop significantly without it.
The dataset contains 12,000 questions. As of April 2026, the frontier band sits in the high 80s, with eight to ten percentage points of spread between the top-tier and the next tier of frontier models. That spread provides meaningful discrimination. MMLU-Pro is the current standard for knowledge benchmarking in 2026.
MMMU: Multimodal Understanding
MMMU (Massive Multitask Multimodal Understanding) was created by Xiang Yue et al. in 2024 as the multimodal equivalent of MMLU. The benchmark contains 11,550 questions drawn from college exams and textbooks across 30 subjects, with each question requiring interpretation of one or more images alongside the text.
MMMU's images span diverse types: diagrams, charts, scientific figures, photographs, maps, music scores, and chemical structures. The multimodal requirement captures a capability dimension that text-only benchmarks entirely miss; a model can score in the low 90s on MMLU-Pro while failing to interpret a basic physics diagram. For evaluating vision-language models, MMMU is the current standard.
As of April 2026 the frontier sits in the high 70s to low 80s, with three to five percentage points between leaders. The gap provides useful discrimination. Unlike MMLU, MMMU has not yet saturated.
Known Contamination Issues
MMLU has significant contamination concerns. The test questions were sourced from freely available study materials; many appear verbatim or near-verbatim on websites included in Common Crawl, the primary web-crawl training dataset used by most frontier models. Researchers have documented specific MMLU test questions appearing in training data for several frontier model lineages.
The practical consequence: a model scoring in the low 90s on MMLU may be recalling test questions it saw during training rather than reasoning through them. This is one reason MMLU scores cluster so tightly at the high end; models may have converged on the training data rather than on the underlying reasoning capability.
MMLU-Pro partially addresses contamination by sourcing questions from a wider pool with less overlap with common training corpora. MMMU has lower contamination risk because image content is less frequently included in text-focused training crawls. Neither benchmark has zero contamination risk; all public test sets face this problem structurally.
Q.01Is a 90% MMLU score impressive in 2026?+
Q.02What subjects does MMLU cover?+
Q.03What is the difference between MMLU and MMLU-Pro?+
Q.04Does MMLU predict real-world LLM usefulness?+
Sources
- [1] Hendrycks et al., MMLU · arxiv.org/abs/2009.03300 · 2020
- [2] Wang et al., MMLU-Pro · arxiv.org/abs/2406.01574 · 2024
- [3] Yue et al., MMMU · mmmu-benchmark.github.io · 2024
- [4] HuggingFace Open LLM Leaderboard v2 · Captured April 2026
- [5] Papers With Code MMLU · paperswithcode.com · Captured April 2026