MATH Benchmark: 12,500 Problems, 99%+ Saturation, What Replaced It
The benchmark that drove the chain-of-thought era, now functionally retired at the top.
The Construction
MATH was published in March 2021 by Dan Hendrycks and collaborators at UC Berkeley. It draws 12,500 problems from American math competitions (AMC 10, AMC 12, AIME, USAMO qualifiers, and similar). Problems are split 7,500 train / 5,000 test, tagged by subject and by difficulty level 1 through 5. Each ships with a full LaTeX solution.
The grading rule is exact match on the final boxed answer after the model emits its full chain-of-thought. There is no partial credit, no judge model, no human in the loop at scoring time. This makes MATH cheap to run and reproducible, two reasons it became the de facto math benchmark for three years.
SOTA Progression 2021 to 2026
Why MATH Saturated
Three forces compounded. First, chain-of-thought prompting plus self-consistency (Wang et al. 2022) lifted GPT-3 class models from single digits to the 50s. Second, math-specific fine-tunes (Minerva, WizardMath, DeepSeekMath) extracted another 15 to 20 points. Third, test-time compute scaling (o1, o3, Sonnet thinking modes) pushed reasoning chains long enough that the remaining error budget on MATH-500 collapsed to single percentage points.
By mid-2025 the headline number stopped moving. When every frontier model scores between 98.6% and 99.4%, the benchmark is no longer measuring capability differences; it is measuring noise plus residual contamination.
What Replaced It
AIME 2024 and AIME 2025 are the new defaults for top-tier math comparison. Each year has 15 problems with integer answers between 0 and 999. The problems are harder than the average MATH item, and pass rates remain spread across the frontier (Sonnet 4.7 around 89%, o3 around 91%, Gemini 3 Pro around 88% on AIME 2025). For Olympiad-level math, Putnam-Bench and the HARP project both retain headroom.
FrontierMath (Epoch AI, late 2024) was designed explicitly to resist saturation. It contains research-grade problems written by working mathematicians and remains below 30% for the strongest 2025 reasoning models. It is the closest thing to an unsaturated math benchmark in 2026.
Q.01What does the MATH benchmark cover?+
Q.02Is MATH still useful in 2026?+
Q.03What replaced MATH?+
Q.04Can a model cheat MATH through training-data leakage?+
Q.05What is MATH-500 and how does it differ?+
Sources
- [1] Hendrycks et al. (2021), MATH paper: arxiv.org/abs/2103.03874
- [2] OpenAI o1 system card (MATH-500): openai.com/index/learning-to-reason-with-llms
- [3] FrontierMath (Epoch AI): epochai.org/frontiermath
- [4] Sclar et al. (2023), prompt sensitivity: arxiv.org/abs/2310.11324
- [5] Papers With Code MATH leaderboard: paperswithcode.com