Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What11,500 college-exam multimodal questions across 6 disciplines, 30 subjects.
WhoYue, Ni, Zhang, Liu, Yu, Sun, et al. (CVPR 2024 best paper finalist).
2026 TierFrontier 75 to 80% on MMMU, around 60% on MMMU-Pro.
Projectmmmu-benchmark.github.io
Section I.vi Multimodal|Last verified April 2026

MMMU Benchmark: 11,500 Multimodal Questions, Frontier Above 70%

The college-exam benchmark for multimodal reasoning. MMMU-Pro is the version frontier teams quote in 2026.

I

Construction

The Yue group sourced questions from college textbooks, quizzes, and exam practice sets across 30 subjects. Each question pairs natural-language text with one or more images: anatomical diagrams in Medicine, equipment photos in Engineering, sheet music in Music, financial charts in Business. The images are not decorative. The reasoning required to answer typically depends on extracting information from the image and combining it with subject-matter knowledge.

The dataset splits into 900 dev (development), 1,050 validation, and 9,550 test. Test labels are private; reported numbers are usually on the validation split unless otherwise stated.

II

SOTA Progression

Date
Tier / Score
Note
Nov 2023
GPT-4V at 56.8% (val)
Original Yue et al. paper, CVPR 2024.
May 2024
Gemini 1.5 Pro at 62.2%
Mainstream multimodal closed-source.
Sep 2024
Claude 3.5 Sonnet at 68.3%
Anthropic model card.
Apr 2025
Frontier above 73%, MMMU-Pro around 55%
Two-tier reporting becomes standard.
May 2026
Sonnet 4.7 around 78%, GPT-5 around 80%, MMMU-Pro frontier around 60%
MMMU saturating, MMMU-Pro is the live benchmark.
III

MMMU vs MMMU-Pro

The MMMU follow-up paper showed that around 30% of original MMMU questions are answerable from the text alone, no image required. MMMU-Pro removes these questions and extends answer options from 4 to 10. The result is a benchmark that resists text-only shortcuts and has more headroom. In 2026, headline multimodal claims should quote MMMU-Pro; MMMU is more useful as a comparison baseline than a frontier signal.

MMLU-Pro for text-only frontier reasoningHumanity's Last ExamContamination in multimodal benchmarks
Reader Questions
Q.01What is MMMU?+
MMMU stands for Massive Multi-discipline Multimodal Understanding. It is a college-exam-style benchmark of 11,500 multimodal questions spanning 6 disciplines (Art and Design, Business, Science, Health and Medicine, Humanities and Social Science, Tech and Engineering) and 30 subjects. Each question pairs text with one or more images (diagrams, plots, photographs, sheet music) and expects a college-level reasoning answer.
Q.02How is MMMU different from MMLU?+
MMLU is text-only multiple choice. MMMU is multimodal, requires understanding of images alongside text, and includes free-form answers as well as multiple choice. MMMU has 30 subjects to MMLU's 57 but each question is much more demanding. They probe different capabilities and a high MMLU score does not predict a high MMMU score.
Q.03What was the launch baseline?+
The original paper (CVPR 2024) reported GPT-4V at 56.8% (validation set). Open-weight LLaVA-1.5 13B trailed at 36.4%. The human-expert ceiling was 88.6%. Three years later, frontier multimodal models cross the 70s on MMMU but human experts retain a roughly 15-point lead on the hardest subjects.
Q.04What is MMMU-Pro?+
MMMU-Pro (Yue et al., 2024 follow-up) is a harder subset that removes text-only answerable questions (about 30% of original MMMU) and increases the answer-option count from 4 to 10. The same frontier models that score 70+ on MMMU sit around 55 on MMMU-Pro. MMMU-Pro is the better current discriminator.
Q.05Is MMMU at risk of saturation?+
MMMU itself is approaching saturation at the top end. Sonnet 4.7 reaches roughly 78% and GPT-5 around 80%. MMMU-Pro still has substantial headroom and is the version to quote for frontier comparisons. The original MMMU is now best used as a baseline check rather than a frontier discriminator.

Sources

  1. [1] Yue et al. (MMMU, 2023): arxiv.org/abs/2311.16502
  2. [2] MMMU project: mmmu-benchmark.github.io
  3. [3] MMMU-Pro (Yue et al., 2024): arxiv.org/abs/2409.02813
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.