Abstract

What~12,000 questions, 14 categories, 10-option multiple choice; the discriminating successor to MMLU

WhoWang et al., TIGER-Lab, 2024 (arXiv:2406.01574)

2026 TierFrontier 86-89 percent; strong open-weights 81-84 percent.

Leaderboardhuggingface.co/spaces/TIGER-Lab/MMLU-Pro

Section I.iv · Knowledge Benchmarks|Last verified April 2026

MMLU-Pro: The Discriminating Successor to MMLU

When MMLU saturated above 90 percent in 2024, the field needed a harder breadth-of-knowledge benchmark that could distinguish frontier models. MMLU-Pro is that benchmark. 10 options instead of 4, deliberately harder questions, and a 40-point spread restored at launch. Two years later, it is starting to saturate too, but it remains the right headline.

What MMLU-Pro is and why it exists

MMLU-Pro, introduced by Wang et al. at TIGER-Lab in June 2024, is the discriminating replacement for MMLU. The original MMLU had served the field well since 2020 but was effectively saturated by mid-2024: frontier models all scored between 88 and 92 percent, the random baseline was 25 percent (compressing the meaningful range), and contamination concerns had been documented. MMLU-Pro addresses these by introducing harder questions, 10 multiple-choice options instead of 4, and a freshly curated test set that removes items proven easy across all models.

The headline difference is the spread. At launch, frontier models scored in the low 70s on MMLU-Pro while scoring in the low 90s on MMLU. The 20-point gap restored a meaningful score range and brought the random baseline down from 25 percent to 10 percent, which expanded the discriminating zone from roughly 25-90 to roughly 30-90. This is what made MMLU-Pro the default knowledge benchmark for frontier comparisons throughout 2024 and 2025.

The benchmark contains roughly 12,000 questions across 14 consolidated categories (down from MMLU's 57 subjects). The consolidation was intentional: MMLU's subject granularity produced noisy per-subject scores because some subjects had only a few dozen questions. MMLU-Pro's 14 categories each contain hundreds-to-thousands of questions, which produces more reliable per-category comparisons. Math, physics, and law dominate the corpus; psychology, history, and philosophy each contribute smaller buckets.

The 14 categories

MMLU-Pro's categories are unbalanced by design. The largest categories (math, physics, law) reflect both the available source material and the fields where multi-step reasoning is most testable. Per-category reporting is essential because relative performance varies meaningfully across categories even for frontier models.

Why 10 options instead of 4

The 10-option choice is the single most consequential design decision in MMLU-Pro. Increasing options from 4 to 10 reduces random-baseline accuracy from 25 percent to 10 percent, which expands the meaningful score range. It also reduces the probability that a model can guess correctly by elimination: with 4 options, a model that confidently rules out two is left with a coin flip; with 10 options, ruling out two still leaves an 8-way choice with substantial residual uncertainty.

The 10-option format also reduces the value of pattern-matching heuristics. On MMLU, several papers documented that models exploited statistical biases in option phrasing (longer options more likely correct, options containing certain syntactic markers more likely correct) to boost scores by 2-4 points. With 10 options, these heuristics are more difficult to apply reliably; the TIGER-Lab team also balanced option lengths and phrasing more carefully than the original MMLU.

The trade-off is that 10-option formatting can be harder to read and harder to format consistently for the model. Some early MMLU-Pro submissions reported parsing errors where the model produced a free-form answer that did not exactly match one of the 10 options; the leaderboard now uses a more robust answer-extraction step that mitigates this issue.

SOTA progression Jun 2024 to May 2026

MMLU-Pro scores have climbed steadily for two years. The launch frontier of 72 percent has become a 2026 frontier of 86-89 percent; that is genuine capability progress, faster than the rate at which MMLU saturated (MMLU went from 70 to 90 percent over a similar two-year window). The benchmark is starting to show saturation signals in 2026: the gap between the top three frontier models is narrowing toward 2 points, similar to where MMLU was in late 2023.

Date

Tier

Note

Jun 2024

Launch, GPT-4o at 72.6%

TIGER-Lab paper at arXiv:2406.01574; 14 categories, 10 options.

Sep 2024

Frontier above 75%

Claude 3.5 Sonnet and Gemini 1.5 Pro close the gap.

Mar 2025

Frontier at 82%

Strong reasoning models (chain-of-thought enabled) climb fast.

Oct 2025

Frontier breaks 85%

Top three closed-source models within 2 points of each other.

May 2026

Frontier 86-89%

Strong open-weights at 81-84%. Convergence is the trend.

Chain-of-thought and N-shot settings

MMLU-Pro scores are sensitive to two evaluation settings: whether chain-of-thought reasoning is enabled, and whether the model is given few-shot examples. The standard leaderboard reports CoT-enabled, 5-shot scores; non-CoT scores are typically 8-12 points lower for capable models. This is a much larger CoT effect than on MMLU, where the format made CoT less helpful. The MMLU-Pro questions are intentionally reasoning-heavy, which amplifies the CoT advantage.

The 5-shot examples are drawn from a fixed development set. Some submissions vary this to maximise their score; the leaderboard's default 5-shot dev examples are the canonical comparison point. Zero-shot MMLU-Pro scores are also reported and are useful when comparing to deployment scenarios where few-shot prompting is impractical. The zero-shot frontier is roughly 5 points below the 5-shot frontier.

When quoting an MMLU-Pro score, always disclose CoT, N-shot, and date. A 2024 zero-shot non-CoT score and a 2026 5-shot CoT-enabled score on the same underlying model can differ by 20 points; treating them as equivalent is misleading.

Contamination considerations

MMLU-Pro is cleaner than MMLU on contamination but is not contamination-free. The TIGER-Lab team removed easy items that all models scored well on (often a contamination signal) and curated new items from textbooks and professional exams. However, the source materials overlap heavily with model pre-training corpora: any model trained on books and academic papers will have seen the underlying content (just not the exact MMLU-Pro phrasing).

This is a softer kind of contamination than verbatim test-leak. The model has not memorised the answer; it has memorised the field, which is what knowledge benchmarks are supposed to measure. The wider question, addressed in our contamination explainer, is whether benchmark scores reflect underlying capability or training-corpus exposure. For MMLU-Pro the honest answer is "both, and they are hard to separate".

The cleanest contamination defence in the knowledge-benchmark family is Humanity's Last Exam (HLE), which curates questions from active researchers and updates the test set periodically. MMLU-Pro is the right pragmatic compromise for general capability evaluation; HLE is the right frontier-difficulty test.

When to use MMLU-Pro in 2026

MMLU-Pro remains the right headline knowledge benchmark in 2026 for general capability comparison. It is broad (14 fields), discriminating (40-point launch spread, 8-point spread remaining in May 2026), open (public dataset and leaderboard), and well-instrumented (per-category scores reported by most submissions). Quote MMLU-Pro alongside SWE-bench Verified, GPQA-Diamond, and the relevant agent benchmarks for any 2026 model comparison.

For frontier-difficulty work prefer HLE; for reasoning specifically prefer GPQA-Diamond; for coding prefer SWE-bench Verified or LiveCodeBench; for agent capability prefer the dedicated agent benchmarks. MMLU-Pro is the knowledge-breadth slice of a portfolio, not a complete evaluation.

Editor's verdictMMLU-Pro is the right 2026 default for breadth-of-knowledge evaluation. The 10-option format and freshly curated test set make it discriminating for now; expect another generation before it saturates. Always quote 5-shot CoT-enabled, with per-category breakdown if claiming subject-specific competence.

Reader Questions

Q.01What is MMLU-Pro?+

MMLU-Pro is a more challenging replacement for MMLU, released by TIGER-Lab in June 2024. It contains roughly 12,000 questions across 14 categories, each with 10 multiple-choice options instead of MMLU's 4. The questions are deliberately harder, more reasoning-dependent, and include only items that proved discriminating across model tiers in the original test set. The headline change: MMLU saturated above 90 percent for frontier models in 2024; MMLU-Pro restored a 40-point spread at launch and remains discriminating in 2026.

Q.02Why was MMLU replaced?+

Three reasons. First, saturation: by mid-2024, frontier models scored 88 to 92 percent on MMLU, and the 4-percent gap between the top three models was statistical noise. Second, the 4-option format produced random-baseline floor of 25 percent, which compressed the meaningful score range to roughly 25-92 percent. Third, contamination concerns: MMLU test questions had been documented in Common Crawl since 2021, and the field grew concerned that strong scores partly reflected memorisation. MMLU-Pro addresses all three through harder questions, 10 options (random baseline 10 percent), and a freshly curated dataset.

Q.03How are the 14 MMLU-Pro categories structured?+

MMLU-Pro consolidates the 57 MMLU subjects into 14 categories: math, physics, chemistry, law, engineering, other (interdisciplinary), economics, health, psychology, business, biology, philosophy, computer science, history. The category counts are unbalanced, with math, physics, and law each contributing 1,000-plus questions; psychology, history, and philosophy contribute a few hundred each. Category-level reporting is the honest read for benchmark comparison.

Q.04What is the current MMLU-Pro SOTA?+

As of May 2026 frontier models score 86 to 89 percent on MMLU-Pro overall. The strongest open-weight models trail by 4 to 7 points. This compares to MMLU where the same models score 92 to 95 percent. MMLU-Pro is closer to a meaningful discriminator in 2026 but is also showing signs of saturation: the gap between the top three frontier models is narrowing toward statistical noise, similar to where MMLU was in 2024.

Q.05Is MMLU-Pro contamination-resistant?+

Partially. The questions are newly curated, which lowers direct memorisation risk. The TIGER-Lab team also removed items that proved easy across all models (likely contaminated) from the original MMLU. However, the source materials (textbooks, professional exams, academic papers) overlap heavily with model pre-training corpora, so indirect contamination remains a concern. MMLU-Pro is cleaner than MMLU on this axis but is not contamination-immune the way GAIA's private test set is.

Q.06Should I use MMLU-Pro or HLE in 2026?+

MMLU-Pro is the appropriate breadth-of-knowledge benchmark; HLE (Humanity's Last Exam) is the appropriate frontier-difficulty benchmark. They measure overlapping but distinct things. MMLU-Pro asks: does the model know undergraduate-to-graduate material across 14 fields? HLE asks: does the model handle questions that PhDs in the field struggle with? Quote both for frontier comparisons; quote MMLU-Pro alone for general capability evaluation.

MMLU (Original) →GPQA and ARC-AGI →Humanity's Last Exam →Reasoning Benchmarks Compared →Benchmark Contamination →Full Benchmark Reference →What Benchmarks Miss →

Sources

[1] Wang, Y. et al. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv:2406.01574.
[2] TIGER-Lab MMLU-Pro leaderboard on Hugging Face Spaces. huggingface.co/spaces/TIGER-Lab/MMLU-Pro. Accessed May 2026.
[3] Hendrycks, D. et al. (2020). Measuring Massive Multitask Language Understanding. arXiv:2009.03300. The original MMLU paper.