Frontier Models on Benchmarks 2026: Sonnet 4.7, GPT-5, Gemini 3 Cited
One table, twelve benchmarks, five models. Every number sourced. No single overall ranking.
The Table
Numbers below are the most recent vendor or leaderboard score we could source for each model on each benchmark. Where the vendor reported a higher headline with a substantially weaker methodology, we use the methodology-disclosed number and note the gap in the per-benchmark page linked below.
What This Table Shows
Three patterns to call out. First, the frontier is tight on most benchmarks: the gap between the leader and second place is within 2 to 4 points on SWE-bench, MMLU-Pro, BFCL, and Tau-Bench. Second, reasoning-tuned models (GPT-5, Sonnet 4.7) lead on AIME and HLE by larger margins (4 to 8 points) because the benchmarks reward extended-thinking budgets. Third, open-weight models trail by a meaningful margin on agentic benchmarks (SWE-bench, Tau-Bench, GAIA) and a smaller margin on knowledge benchmarks (MMLU-Pro). The agentic gap is the most consequential for practitioners choosing a deployment model.
What This Table Hides
Cost. Sonnet 4.7 input is $3 per million tokens; Llama 4 70B can be self-hosted at roughly $0.40 per million tokens through a hosted endpoint. The cost spread is roughly 8x. A 5-point benchmark gap is rarely worth an 8x cost multiplier in production; a 20-point gap usually is. Read the table alongside the cost calculator linked below.
Q.01How were these numbers selected?+
Q.02Why do some numbers differ from vendor model cards?+
Q.03Should I trust these numbers for my deployment decision?+
Q.04Why is there no single overall ranking?+
Q.05When was this last updated?+
Sources
- [1] SWE-bench leaderboard: swebench.com
- [2] HF Open LLM Leaderboard v2: huggingface.co/spaces/open-llm-leaderboard
- [3] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
- [4] HLE leaderboard: lastexam.ai
- [5] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard
- [6] LiveCodeBench: livecodebench.github.io