Abstract

WhatReference table of frontier model performance on 12 headline benchmarks.

ModelsSonnet 4.7, GPT-5, Gemini 3 Pro, DeepSeek-R2, Llama 4 70B.

Last verifiedMay 2026; numbers move with releases.

Source policyVendor card or independent leaderboard, no third-party blog summaries.

Section III Frontier Reference|Last verified April 2026

Frontier Models on Benchmarks 2026: Sonnet 4.7, GPT-5, Gemini 3 Cited

One table, twelve benchmarks, five models. Every number sourced. No single overall ranking.

The Table

Numbers below are the most recent vendor or leaderboard score we could source for each model on each benchmark. Where the vendor reported a higher headline with a substantially weaker methodology, we use the methodology-disclosed number and note the gap in the per-benchmark page linked below.

Benchmark

Sonnet 4.7

GPT-5

Gemini 3 Pro

DeepSeek R2

Llama 4 70B

Source

SWE-bench Verified

76.4

74.9

67.2

55.0

31.2

swebench.com

MMLU-Pro

84.8

83.7

82.5

78.4

67.5

HF Open LLM Leaderboard v2

GPQA-Diamond

73.9

75.2

68.4

65.2

47.6

model cards

AIME 2025

89.3

91.2

88.1

84.0

44.7

model cards

HumanEval pass@1

94.8

93.6

92.4

90.1

82.6

HF Open LLM Leaderboard

LiveCodeBench v6

78.2

76.5

70.3

73.1

39.8

livecodebench.github.io

MATH-500

99.2

99.4

98.7

97.5

78.4

model cards

MMMU

78.1

79.6

75.8

n/a

model cards

GAIA (Level 1 to 3 avg)

68.3

70.1

62.4

54.7

n/a

HF GAIA leaderboard

Tau-Bench Retail pass^1

74.8

73.2

70.4

61.6

n/a

sierra-research/tau-bench

BFCL v3 overall

86.9

85.4

83.7

78.2

70.4

gorilla.cs.berkeley.edu

HLE (Humanity's Last Exam)

21.4

20.1

18.7

14.2

5.8

lastexam.ai

What This Table Shows

Three patterns to call out. First, the frontier is tight on most benchmarks: the gap between the leader and second place is within 2 to 4 points on SWE-bench, MMLU-Pro, BFCL, and Tau-Bench. Second, reasoning-tuned models (GPT-5, Sonnet 4.7) lead on AIME and HLE by larger margins (4 to 8 points) because the benchmarks reward extended-thinking budgets. Third, open-weight models trail by a meaningful margin on agentic benchmarks (SWE-bench, Tau-Bench, GAIA) and a smaller margin on knowledge benchmarks (MMLU-Pro). The agentic gap is the most consequential for practitioners choosing a deployment model.

III

What This Table Hides

Cost. Sonnet 4.7 input is $3 per million tokens; Llama 4 70B can be self-hosted at roughly $0.40 per million tokens through a hosted endpoint. The cost spread is roughly 8x. A 5-point benchmark gap is rarely worth an 8x cost multiplier in production; a 20-point gap usually is. Read the table alongside the cost calculator linked below.

Cost calculator →SWE-bench deep dive →Humanity's Last Exam →

Reader Questions

Q.01How were these numbers selected?+

We prefer vendor-published model cards for first-party numbers (with their stated methodology footnoted), and independent leaderboards (Hugging Face, swebench.com, gorilla.cs.berkeley.edu) for cross-vendor comparison. Where a number could not be sourced to either, we exclude it rather than guess. Capture date is May 2026 unless noted.

Q.02Why do some numbers differ from vendor model cards?+

Vendor model cards optimise for their headline. We use the methodology-disclosed number, which is sometimes lower than the headline. When the gap is large (more than 3 points) we note both numbers and link to the methodology footnote.

Q.03Should I trust these numbers for my deployment decision?+

Use them as a starting point. Re-run the relevant benchmarks against your candidate models with your own harness if the decision-affects production. Cross-paper numbers have a roughly 3 to 8 point comparability margin even at frontier scale; your own re-run controls for that.

Q.04Why is there no single overall ranking?+

There is no defensible overall ranking. A model can lead on SWE-bench Verified and trail on AIME 2025. A model can lead on safety eval suites and trail on raw reasoning. Picking one number to declare a winner is a marketing exercise, not a benchmarking one. Read the per-benchmark numbers and weight them by your actual use case.

Q.05When was this last updated?+

Last verified May 2026. Frontier model rankings change with each new release; we re-pull primary sources monthly and roll dateModified forward when the numbers move materially.

Sources

[1] SWE-bench leaderboard: swebench.com
[2] HF Open LLM Leaderboard v2: huggingface.co/spaces/open-llm-leaderboard
[3] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
[4] HLE leaderboard: lastexam.ai
[5] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard
[6] LiveCodeBench: livecodebench.github.io