Abstract

What12,500 competition-math problems from AMC, AIME, and similar contests.

WhoHendrycks, Burns, Kadavath, Arora, Basart, Tang, Song, Steinhardt (UC Berkeley, 2021).

2026 TierSaturated: frontier above 99% on MATH-500.

Section I.iv Knowledge and Reasoning|Last verified April 2026

MATH Benchmark: 12,500 Problems, 99%+ Saturation, What Replaced It

The benchmark that drove the chain-of-thought era, now functionally retired at the top.

The Construction

MATH was published in March 2021 by Dan Hendrycks and collaborators at UC Berkeley. It draws 12,500 problems from American math competitions (AMC 10, AMC 12, AIME, USAMO qualifiers, and similar). Problems are split 7,500 train / 5,000 test, tagged by subject and by difficulty level 1 through 5. Each ships with a full LaTeX solution.

The grading rule is exact match on the final boxed answer after the model emits its full chain-of-thought. There is no partial credit, no judge model, no human in the loop at scoring time. This makes MATH cheap to run and reproducible, two reasons it became the de facto math benchmark for three years.

SOTA Progression 2021 to 2026

Date

Tier / Score

Note

Mar 2021

GPT-3 baseline at 6.9%

Original MATH paper, no CoT, no tools.

Oct 2022

Minerva 540B at 50.3%

First serious math-finetune; CoT and majority vote.

Mar 2023

GPT-4 at 42.5% (0-shot CoT)

Pre-tool, pre-self-consistency.

Dec 2023

Frontier with tools at 78-84%

Code interpreter unlocks big gains on arithmetic.

Sep 2024

o1-preview at 94.8% (MATH-500)

Test-time compute scaling, reasoning chains.

Apr 2025

Frontier above 99% (MATH-500)

Effectively saturated for top-tier comparison.

May 2026

Used as sanity check only

Headline math comparison has moved to AIME 2025 and FrontierMath.

III

Why MATH Saturated

Three forces compounded. First, chain-of-thought prompting plus self-consistency (Wang et al. 2022) lifted GPT-3 class models from single digits to the 50s. Second, math-specific fine-tunes (Minerva, WizardMath, DeepSeekMath) extracted another 15 to 20 points. Third, test-time compute scaling (o1, o3, Sonnet thinking modes) pushed reasoning chains long enough that the remaining error budget on MATH-500 collapsed to single percentage points.

By mid-2025 the headline number stopped moving. When every frontier model scores between 98.6% and 99.4%, the benchmark is no longer measuring capability differences; it is measuring noise plus residual contamination.

What Replaced It

AIME 2024 and AIME 2025 are the new defaults for top-tier math comparison. Each year has 15 problems with integer answers between 0 and 999. The problems are harder than the average MATH item, and pass rates remain spread across the frontier (Sonnet 4.7 around 89%, o3 around 91%, Gemini 3 Pro around 88% on AIME 2025). For Olympiad-level math, Putnam-Bench and the HARP project both retain headroom.

FrontierMath (Epoch AI, late 2024) was designed explicitly to resist saturation. It contains research-grade problems written by working mathematicians and remains below 30% for the strongest 2025 reasoning models. It is the closest thing to an unsaturated math benchmark in 2026.

How benchmark contamination distorts MATH →Reasoning benchmarks compared →pass@1 vs pass@k methodology →

Reader Questions

Q.01What does the MATH benchmark cover?+

MATH is a dataset of 12,500 competition mathematics problems sourced from AMC 10, AMC 12, AIME, and similar contests, classified into seven subjects (algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, precalculus) and five difficulty levels. Each problem ships with a step-by-step solution, and accuracy is judged by exact match on the final boxed answer.

Q.02Is MATH still useful in 2026?+

Not for top-tier model comparison. Frontier models pass MATH above 99% with chain-of-thought and tool use, leaving no resolution between leaders. MATH is still useful as a sanity check, for mid-tier and open-weight comparisons, and as a teaching artefact for evaluation methodology.

Q.03What replaced MATH?+

AIME 2024 and AIME 2025 (15 problems, integer answers, harder than the MATH average) and Olympiad-level benchmarks like Putnam-Bench and Hardy-Littlewood. For very hard math reasoning, FrontierMath (Epoch AI 2024) was designed explicitly to resist saturation and remains below 30% for the strongest 2025 models.

Q.04Can a model cheat MATH through training-data leakage?+

Yes, partly. The MATH dataset has been on the public web since 2021. Sclar et al. and the BIG-bench team have documented near-verbatim test problems appearing in scraped corpora. The Hendrycks group released a contamination-cleaned eval (MATH-500) in 2024 to mitigate this, but contamination is not fully eliminated.

Q.05What is MATH-500 and how does it differ?+

MATH-500 is a 500-problem subset of the MATH test split selected by OpenAI for the o1 release, used to make evaluation tractable and reduce dataset-overlap artefacts. It is what most 2024 and 2025 model cards mean when they quote MATH accuracy. Direct comparison of full-MATH (5,000 test problems) numbers to MATH-500 numbers is not strictly valid.

Sources

[1] Hendrycks et al. (2021), MATH paper: arxiv.org/abs/2103.03874
[2] OpenAI o1 system card (MATH-500): openai.com/index/learning-to-reason-with-llms
[3] FrontierMath (Epoch AI): epochai.org/frontiermath
[4] Sclar et al. (2023), prompt sensitivity: arxiv.org/abs/2310.11324
[5] Papers With Code MATH leaderboard: paperswithcode.com