Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
TL;DR
MMLU SOTA: 93.4% (saturated - do not use)
MMLU-Pro SOTA: 86.1% (GPT-5, Apr 2026)
MMMU SOTA: 81.2% (GPT-5, Apr 2026)
Last verified April 2026

MMLU, MMLU-Pro, and MMMU Explained - The Knowledge Benchmarks in 2026

MMLU (Massive Multitask Language Understanding) was the defining benchmark of the 2020-2023 era. By 2026 it is saturated and should not be used for frontier model comparisons. MMLU-Pro is the current standard for knowledge benchmarking. MMMU extends the concept to multimodal understanding. This page explains all three, when to use each, and what the contamination concerns mean in practice.

Original MMLU

MMLU was created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (UC Berkeley, University of Chicago), published in 2020. The benchmark contains 15,908 multiple-choice questions across 57 subjects, from elementary mathematics to professional law and medicine. Questions have 4 answer choices. The standard evaluation is 5-shot (the model sees 5 examples before each test question).

MMLU was transformative because it established a broad, standardised benchmark at a time when evaluation was fragmented. GPT-3.5 scored approximately 70% (2022). GPT-4 scored 86.4% (2023), triggering widespread coverage. That 86.4% figure is still quoted in 2026 as if it represents the frontier - it does not. As of April 2026, the best frontier models exceed 93%.

Saturation verdict

MMLU is saturated. Frontier models score 92-94% and the variance between them is less than 2 percentage points - within measurement noise for a 15,908-question test. The benchmark can no longer discriminate frontier models. Do not use MMLU for current model comparisons. Cite it only when comparing with pre-2024 models for historical analysis.

MMLU-Pro

MMLU-Pro was created by Yubo Wang, Xueguang Ma, Ge Zhang, and Jiale Yan (TIGER-Lab, University of Waterloo and others), published in 2024. It redesigns MMLU to address two specific problems: 4-choice saturation and insufficient reasoning depth.

The key changes: MMLU-Pro has 10 answer choices per question (not 4). This alone makes process-of-elimination much less effective and forces the model to reason more carefully. The questions were filtered for higher difficulty and curated from a wider pool including textbooks, STEM exams, and competition problems. MMLU-Pro was designed assuming chain-of-thought prompting - the standard evaluation uses CoT, and scores drop significantly without it.

The dataset contains 12,000 questions. As of April 2026, GPT-5 leads at 86.1% and Claude 4.5 Opus is at 85.2%. The range between frontier models is 8-10 percentage points, providing meaningful discrimination. MMLU-Pro is the current standard for knowledge benchmarking in 2026.

ModelMMLU-ProMMLUCaptured
GPT-586.1%93.4%Apr 2026
Claude 4.5 Opus85.2%92.8%Apr 2026
Gemini 2.5 Pro83.7%92.1%Apr 2026
Grok 482.3%91.7%Apr 2026
Claude 4 Sonnet81.4%91.2%Apr 2026
Llama 4 Maverick79.8%90.4%Apr 2026
DeepSeek V378.9%89.8%Apr 2026
MMLU-Pro: 5-shot with CoT. MMLU: 5-shot no CoT. MMLU scores struck through to indicate saturation - not useful for current comparisons.

MMMU - Multimodal Understanding

MMMU (Massive Multitask Multimodal Understanding) was created by Xiang Yue et al. (2024) as the multimodal equivalent of MMLU. The benchmark contains 11,550 questions drawn from college exams and textbooks across 30 subjects, with each question requiring interpretation of one or more images alongside the text.

MMMU's images span diverse types: diagrams, charts, scientific figures, photographs, maps, music scores, and chemical structures. The multimodal requirement captures a capability dimension that text-only benchmarks entirely miss - a model can score 93% on MMLU-Pro while failing to interpret a basic physics diagram. For evaluating vision-language models, MMMU is the current standard.

As of April 2026, GPT-5 leads at 81.2%, with Claude 4.5 Opus at approximately 79.4% and Gemini 2.5 Pro at 78.7%. The gap between frontier models is 3-5 percentage points, providing useful discrimination. Unlike MMLU, MMMU has not yet saturated.

Known Contamination Issues

MMLU has significant contamination concerns. The test questions were sourced from freely available study materials - many appear verbatim or near-verbatim on websites included in Common Crawl, the primary web-crawl training dataset used by most frontier models. Researchers have documented specific MMLU test questions appearing in training data for several frontier models.

The practical consequence: a model scoring 93% on MMLU may be recalling test questions it saw during training rather than reasoning through them. This is one reason MMLU scores cluster so tightly at the high end - models may have converged on the training data rather than on the underlying reasoning capability.

MMLU-Pro partially addresses contamination by sourcing questions from a wider pool with less overlap with common training corpora. MMMU has lower contamination risk because image content is less frequently included in text-focused training crawls. Neither benchmark has zero contamination risk - all public test sets face this problem structurally.

Full contamination discussion ->

Frequently Asked Questions

Is a 90% MMLU score impressive in 2026?+
No. By April 2026, frontier models score 92-94% on MMLU, and the benchmark is considered saturated. A 90% MMLU score in 2026 means a model is performing below the current frontier. For meaningful comparisons, use MMLU-Pro (where frontier models score 79-86%) or GPQA-Diamond (71-78%).
What subjects does MMLU cover?+
MMLU covers 57 subjects across STEM, humanities, social sciences, and professional domains. STEM subjects include mathematics, physics, chemistry, biology, computer science, and engineering. Professional domains include medicine, accounting, and business. The breadth is MMLU's original strength and also makes it susceptible to contamination.
What is the difference between MMLU and MMLU-Pro?+
MMLU has 4 answer choices per question; MMLU-Pro has 10. MMLU-Pro requires more deliberate reasoning because process of elimination is much less effective with 10 choices. MMLU-Pro was designed assuming chain-of-thought prompting and better discriminates frontier models in 2026.
Does MMLU predict real-world LLM usefulness?+
Weakly. MMLU correlates with general knowledge breadth, which is relevant for many tasks. But MMLU does not test reasoning under ambiguity, long-context comprehension, instruction following, or agentic capability. A model with 94% MMLU can still produce confidently wrong answers or struggle with multi-step agent tasks.

Sources

  1. [1] Hendrycks et al., MMLU - arxiv.org/abs/2009.03300 - 2020
  2. [2] Wang et al., MMLU-Pro - arxiv.org/abs/2406.01574 - 2024
  3. [3] Yue et al., MMMU - mmmu-benchmark.github.io - 2024
  4. [4] HuggingFace Open LLM Leaderboard v2 - Captured April 2026
  5. [5] Papers With Code MMLU - paperswithcode.com - Captured April 2026