MMLU-Pro: The Discriminating Successor to MMLU
When MMLU saturated above 90 percent in 2024, the field needed a harder breadth-of-knowledge benchmark that could distinguish frontier models. MMLU-Pro is that benchmark. 10 options instead of 4, deliberately harder questions, and a 40-point spread restored at launch. Two years later, it is starting to saturate too, but it remains the right headline.
What MMLU-Pro is and why it exists
MMLU-Pro, introduced by Wang et al. at TIGER-Lab in June 2024, is the discriminating replacement for MMLU. The original MMLU had served the field well since 2020 but was effectively saturated by mid-2024: frontier models all scored between 88 and 92 percent, the random baseline was 25 percent (compressing the meaningful range), and contamination concerns had been documented. MMLU-Pro addresses these by introducing harder questions, 10 multiple-choice options instead of 4, and a freshly curated test set that removes items proven easy across all models.
The headline difference is the spread. At launch, frontier models scored in the low 70s on MMLU-Pro while scoring in the low 90s on MMLU. The 20-point gap restored a meaningful score range and brought the random baseline down from 25 percent to 10 percent, which expanded the discriminating zone from roughly 25-90 to roughly 30-90. This is what made MMLU-Pro the default knowledge benchmark for frontier comparisons throughout 2024 and 2025.
The benchmark contains roughly 12,000 questions across 14 consolidated categories (down from MMLU's 57 subjects). The consolidation was intentional: MMLU's subject granularity produced noisy per-subject scores because some subjects had only a few dozen questions. MMLU-Pro's 14 categories each contain hundreds-to-thousands of questions, which produces more reliable per-category comparisons. Math, physics, and law dominate the corpus; psychology, history, and philosophy each contribute smaller buckets.
The 14 categories
MMLU-Pro's categories are unbalanced by design. The largest categories (math, physics, law) reflect both the available source material and the fields where multi-step reasoning is most testable. Per-category reporting is essential because relative performance varies meaningfully across categories even for frontier models.
The relative difficulty pattern across categories is stable. Math, physics, and engineering are typically the hardest for any given model; law and health sit in the middle (knowledge-heavy with some reasoning); psychology and history are the easiest (mostly knowledge recall). A model that scores 90 percent overall on MMLU-Pro typically scores 95 percent on psychology and 80 percent on math; this kind of per-category disclosure is more informative than the headline number.
Why 10 options instead of 4
The 10-option choice is the single most consequential design decision in MMLU-Pro. Increasing options from 4 to 10 reduces random-baseline accuracy from 25 percent to 10 percent, which expands the meaningful score range. It also reduces the probability that a model can guess correctly by elimination: with 4 options, a model that confidently rules out two is left with a coin flip; with 10 options, ruling out two still leaves an 8-way choice with substantial residual uncertainty.
The 10-option format also reduces the value of pattern-matching heuristics. On MMLU, several papers documented that models exploited statistical biases in option phrasing (longer options more likely correct, options containing certain syntactic markers more likely correct) to boost scores by 2-4 points. With 10 options, these heuristics are more difficult to apply reliably; the TIGER-Lab team also balanced option lengths and phrasing more carefully than the original MMLU.
The trade-off is that 10-option formatting can be harder to read and harder to format consistently for the model. Some early MMLU-Pro submissions reported parsing errors where the model produced a free-form answer that did not exactly match one of the 10 options; the leaderboard now uses a more robust answer-extraction step that mitigates this issue.
SOTA progression Jun 2024 to May 2026
MMLU-Pro scores have climbed steadily for two years. The launch frontier of 72 percent has become a 2026 frontier of 86-89 percent; that is genuine capability progress, faster than the rate at which MMLU saturated (MMLU went from 70 to 90 percent over a similar two-year window). The benchmark is starting to show saturation signals in 2026: the gap between the top three frontier models is narrowing toward 2 points, similar to where MMLU was in late 2023.
Chain-of-thought and N-shot settings
MMLU-Pro scores are sensitive to two evaluation settings: whether chain-of-thought reasoning is enabled, and whether the model is given few-shot examples. The standard leaderboard reports CoT-enabled, 5-shot scores; non-CoT scores are typically 8-12 points lower for capable models. This is a much larger CoT effect than on MMLU, where the format made CoT less helpful. The MMLU-Pro questions are intentionally reasoning-heavy, which amplifies the CoT advantage.
The 5-shot examples are drawn from a fixed development set. Some submissions vary this to maximise their score; the leaderboard's default 5-shot dev examples are the canonical comparison point. Zero-shot MMLU-Pro scores are also reported and are useful when comparing to deployment scenarios where few-shot prompting is impractical. The zero-shot frontier is roughly 5 points below the 5-shot frontier.
When quoting an MMLU-Pro score, always disclose CoT, N-shot, and date. A 2024 zero-shot non-CoT score and a 2026 5-shot CoT-enabled score on the same underlying model can differ by 20 points; treating them as equivalent is misleading.
Contamination considerations
MMLU-Pro is cleaner than MMLU on contamination but is not contamination-free. The TIGER-Lab team removed easy items that all models scored well on (often a contamination signal) and curated new items from textbooks and professional exams. However, the source materials overlap heavily with model pre-training corpora: any model trained on books and academic papers will have seen the underlying content (just not the exact MMLU-Pro phrasing).
This is a softer kind of contamination than verbatim test-leak. The model has not memorised the answer; it has memorised the field, which is what knowledge benchmarks are supposed to measure. The wider question, addressed in our contamination explainer, is whether benchmark scores reflect underlying capability or training-corpus exposure. For MMLU-Pro the honest answer is "both, and they are hard to separate".
The cleanest contamination defence in the knowledge-benchmark family is Humanity's Last Exam (HLE), which curates questions from active researchers and updates the test set periodically. MMLU-Pro is the right pragmatic compromise for general capability evaluation; HLE is the right frontier-difficulty test.
When to use MMLU-Pro in 2026
MMLU-Pro remains the right headline knowledge benchmark in 2026 for general capability comparison. It is broad (14 fields), discriminating (40-point launch spread, 8-point spread remaining in May 2026), open (public dataset and leaderboard), and well-instrumented (per-category scores reported by most submissions). Quote MMLU-Pro alongside SWE-bench Verified, GPQA-Diamond, and the relevant agent benchmarks for any 2026 model comparison.
For frontier-difficulty work prefer HLE; for reasoning specifically prefer GPQA-Diamond; for coding prefer SWE-bench Verified or LiveCodeBench; for agent capability prefer the dedicated agent benchmarks. MMLU-Pro is the knowledge-breadth slice of a portfolio, not a complete evaluation.
Q.01What is MMLU-Pro?+
Q.02Why was MMLU replaced?+
Q.03How are the 14 MMLU-Pro categories structured?+
Q.04What is the current MMLU-Pro SOTA?+
Q.05Is MMLU-Pro contamination-resistant?+
Q.06Should I use MMLU-Pro or HLE in 2026?+
Sources
- [1] Wang, Y. et al. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv:2406.01574.
- [2] TIGER-Lab MMLU-Pro leaderboard on Hugging Face Spaces. huggingface.co/spaces/TIGER-Lab/MMLU-Pro. Accessed May 2026.
- [3] Hendrycks, D. et al. (2020). Measuring Massive Multitask Language Understanding. arXiv:2009.03300. The original MMLU paper.