MMLU, MMLU-Pro, and MMMU Explained - The Knowledge Benchmarks in 2026
MMLU (Massive Multitask Language Understanding) was the defining benchmark of the 2020-2023 era. By 2026 it is saturated and should not be used for frontier model comparisons. MMLU-Pro is the current standard for knowledge benchmarking. MMMU extends the concept to multimodal understanding. This page explains all three, when to use each, and what the contamination concerns mean in practice.
Original MMLU
MMLU was created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (UC Berkeley, University of Chicago), published in 2020. The benchmark contains 15,908 multiple-choice questions across 57 subjects, from elementary mathematics to professional law and medicine. Questions have 4 answer choices. The standard evaluation is 5-shot (the model sees 5 examples before each test question).
MMLU was transformative because it established a broad, standardised benchmark at a time when evaluation was fragmented. GPT-3.5 scored approximately 70% (2022). GPT-4 scored 86.4% (2023), triggering widespread coverage. That 86.4% figure is still quoted in 2026 as if it represents the frontier - it does not. As of April 2026, the best frontier models exceed 93%.
Saturation verdict
MMLU is saturated. Frontier models score 92-94% and the variance between them is less than 2 percentage points - within measurement noise for a 15,908-question test. The benchmark can no longer discriminate frontier models. Do not use MMLU for current model comparisons. Cite it only when comparing with pre-2024 models for historical analysis.
MMLU-Pro
MMLU-Pro was created by Yubo Wang, Xueguang Ma, Ge Zhang, and Jiale Yan (TIGER-Lab, University of Waterloo and others), published in 2024. It redesigns MMLU to address two specific problems: 4-choice saturation and insufficient reasoning depth.
The key changes: MMLU-Pro has 10 answer choices per question (not 4). This alone makes process-of-elimination much less effective and forces the model to reason more carefully. The questions were filtered for higher difficulty and curated from a wider pool including textbooks, STEM exams, and competition problems. MMLU-Pro was designed assuming chain-of-thought prompting - the standard evaluation uses CoT, and scores drop significantly without it.
The dataset contains 12,000 questions. As of April 2026, GPT-5 leads at 86.1% and Claude 4.5 Opus is at 85.2%. The range between frontier models is 8-10 percentage points, providing meaningful discrimination. MMLU-Pro is the current standard for knowledge benchmarking in 2026.
| Model | MMLU-Pro | MMLU | Captured |
|---|---|---|---|
| GPT-5 | 86.1% | 93.4% | Apr 2026 |
| Claude 4.5 Opus | 85.2% | 92.8% | Apr 2026 |
| Gemini 2.5 Pro | 83.7% | 92.1% | Apr 2026 |
| Grok 4 | 82.3% | 91.7% | Apr 2026 |
| Claude 4 Sonnet | 81.4% | 91.2% | Apr 2026 |
| Llama 4 Maverick | 79.8% | 90.4% | Apr 2026 |
| DeepSeek V3 | 78.9% | 89.8% | Apr 2026 |
| MMLU-Pro: 5-shot with CoT. MMLU: 5-shot no CoT. MMLU scores struck through to indicate saturation - not useful for current comparisons. | |||
MMMU - Multimodal Understanding
MMMU (Massive Multitask Multimodal Understanding) was created by Xiang Yue et al. (2024) as the multimodal equivalent of MMLU. The benchmark contains 11,550 questions drawn from college exams and textbooks across 30 subjects, with each question requiring interpretation of one or more images alongside the text.
MMMU's images span diverse types: diagrams, charts, scientific figures, photographs, maps, music scores, and chemical structures. The multimodal requirement captures a capability dimension that text-only benchmarks entirely miss - a model can score 93% on MMLU-Pro while failing to interpret a basic physics diagram. For evaluating vision-language models, MMMU is the current standard.
As of April 2026, GPT-5 leads at 81.2%, with Claude 4.5 Opus at approximately 79.4% and Gemini 2.5 Pro at 78.7%. The gap between frontier models is 3-5 percentage points, providing useful discrimination. Unlike MMLU, MMMU has not yet saturated.
Known Contamination Issues
MMLU has significant contamination concerns. The test questions were sourced from freely available study materials - many appear verbatim or near-verbatim on websites included in Common Crawl, the primary web-crawl training dataset used by most frontier models. Researchers have documented specific MMLU test questions appearing in training data for several frontier models.
The practical consequence: a model scoring 93% on MMLU may be recalling test questions it saw during training rather than reasoning through them. This is one reason MMLU scores cluster so tightly at the high end - models may have converged on the training data rather than on the underlying reasoning capability.
MMLU-Pro partially addresses contamination by sourcing questions from a wider pool with less overlap with common training corpora. MMMU has lower contamination risk because image content is less frequently included in text-focused training crawls. Neither benchmark has zero contamination risk - all public test sets face this problem structurally.
Frequently Asked Questions
Is a 90% MMLU score impressive in 2026?+
What subjects does MMLU cover?+
What is the difference between MMLU and MMLU-Pro?+
Does MMLU predict real-world LLM usefulness?+
Sources
- [1] Hendrycks et al., MMLU - arxiv.org/abs/2009.03300 - 2020
- [2] Wang et al., MMLU-Pro - arxiv.org/abs/2406.01574 - 2024
- [3] Yue et al., MMMU - mmmu-benchmark.github.io - 2024
- [4] HuggingFace Open LLM Leaderboard v2 - Captured April 2026
- [5] Papers With Code MMLU - paperswithcode.com - Captured April 2026