MMMU Benchmark: 11,500 Multimodal Questions, Frontier Above 70%
The college-exam benchmark for multimodal reasoning. MMMU-Pro is the version frontier teams quote in 2026.
Construction
The Yue group sourced questions from college textbooks, quizzes, and exam practice sets across 30 subjects. Each question pairs natural-language text with one or more images: anatomical diagrams in Medicine, equipment photos in Engineering, sheet music in Music, financial charts in Business. The images are not decorative. The reasoning required to answer typically depends on extracting information from the image and combining it with subject-matter knowledge.
The dataset splits into 900 dev (development), 1,050 validation, and 9,550 test. Test labels are private; reported numbers are usually on the validation split unless otherwise stated.
SOTA Progression
MMMU vs MMMU-Pro
The MMMU follow-up paper showed that around 30% of original MMMU questions are answerable from the text alone, no image required. MMMU-Pro removes these questions and extends answer options from 4 to 10. The result is a benchmark that resists text-only shortcuts and has more headroom. In 2026, headline multimodal claims should quote MMMU-Pro; MMMU is more useful as a comparison baseline than a frontier signal.
Q.01What is MMMU?+
Q.02How is MMMU different from MMLU?+
Q.03What was the launch baseline?+
Q.04What is MMMU-Pro?+
Q.05Is MMMU at risk of saturation?+
Sources
- [1] Yue et al. (MMMU, 2023): arxiv.org/abs/2311.16502
- [2] MMMU project: mmmu-benchmark.github.io
- [3] MMMU-Pro (Yue et al., 2024): arxiv.org/abs/2409.02813