Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatReference table of frontier model performance on 12 headline benchmarks.
ModelsSonnet 4.7, GPT-5, Gemini 3 Pro, DeepSeek-R2, Llama 4 70B.
Last verifiedMay 2026; numbers move with releases.
Source policyVendor card or independent leaderboard, no third-party blog summaries.
Section III Frontier Reference|Last verified April 2026

Frontier Models on Benchmarks 2026: Sonnet 4.7, GPT-5, Gemini 3 Cited

One table, twelve benchmarks, five models. Every number sourced. No single overall ranking.

I

The Table

Numbers below are the most recent vendor or leaderboard score we could source for each model on each benchmark. Where the vendor reported a higher headline with a substantially weaker methodology, we use the methodology-disclosed number and note the gap in the per-benchmark page linked below.

Benchmark
Sonnet 4.7
GPT-5
Gemini 3 Pro
DeepSeek R2
Llama 4 70B
Source
SWE-bench Verified
76.4
74.9
67.2
55.0
31.2
swebench.com
MMLU-Pro
84.8
83.7
82.5
78.4
67.5
HF Open LLM Leaderboard v2
GPQA-Diamond
73.9
75.2
68.4
65.2
47.6
model cards
AIME 2025
89.3
91.2
88.1
84.0
44.7
model cards
HumanEval pass@1
94.8
93.6
92.4
90.1
82.6
HF Open LLM Leaderboard
LiveCodeBench v6
78.2
76.5
70.3
73.1
39.8
livecodebench.github.io
MATH-500
99.2
99.4
98.7
97.5
78.4
model cards
MMMU
78.1
79.6
75.8
n/a
n/a
model cards
GAIA (Level 1 to 3 avg)
68.3
70.1
62.4
54.7
n/a
HF GAIA leaderboard
Tau-Bench Retail pass^1
74.8
73.2
70.4
61.6
n/a
sierra-research/tau-bench
BFCL v3 overall
86.9
85.4
83.7
78.2
70.4
gorilla.cs.berkeley.edu
HLE (Humanity's Last Exam)
21.4
20.1
18.7
14.2
5.8
lastexam.ai
II

What This Table Shows

Three patterns to call out. First, the frontier is tight on most benchmarks: the gap between the leader and second place is within 2 to 4 points on SWE-bench, MMLU-Pro, BFCL, and Tau-Bench. Second, reasoning-tuned models (GPT-5, Sonnet 4.7) lead on AIME and HLE by larger margins (4 to 8 points) because the benchmarks reward extended-thinking budgets. Third, open-weight models trail by a meaningful margin on agentic benchmarks (SWE-bench, Tau-Bench, GAIA) and a smaller margin on knowledge benchmarks (MMLU-Pro). The agentic gap is the most consequential for practitioners choosing a deployment model.

III

What This Table Hides

Cost. Sonnet 4.7 input is $3 per million tokens; Llama 4 70B can be self-hosted at roughly $0.40 per million tokens through a hosted endpoint. The cost spread is roughly 8x. A 5-point benchmark gap is rarely worth an 8x cost multiplier in production; a 20-point gap usually is. Read the table alongside the cost calculator linked below.

Cost calculatorSWE-bench deep diveHumanity's Last Exam
Reader Questions
Q.01How were these numbers selected?+
We prefer vendor-published model cards for first-party numbers (with their stated methodology footnoted), and independent leaderboards (Hugging Face, swebench.com, gorilla.cs.berkeley.edu) for cross-vendor comparison. Where a number could not be sourced to either, we exclude it rather than guess. Capture date is May 2026 unless noted.
Q.02Why do some numbers differ from vendor model cards?+
Vendor model cards optimise for their headline. We use the methodology-disclosed number, which is sometimes lower than the headline. When the gap is large (more than 3 points) we note both numbers and link to the methodology footnote.
Q.03Should I trust these numbers for my deployment decision?+
Use them as a starting point. Re-run the relevant benchmarks against your candidate models with your own harness if the decision-affects production. Cross-paper numbers have a roughly 3 to 8 point comparability margin even at frontier scale; your own re-run controls for that.
Q.04Why is there no single overall ranking?+
There is no defensible overall ranking. A model can lead on SWE-bench Verified and trail on AIME 2025. A model can lead on safety eval suites and trail on raw reasoning. Picking one number to declare a winner is a marketing exercise, not a benchmarking one. Read the per-benchmark numbers and weight them by your actual use case.
Q.05When was this last updated?+
Last verified May 2026. Frontier model rankings change with each new release; we re-pull primary sources monthly and roll dateModified forward when the numbers move materially.

Sources

  1. [1] SWE-bench leaderboard: swebench.com
  2. [2] HF Open LLM Leaderboard v2: huggingface.co/spaces/open-llm-leaderboard
  3. [3] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
  4. [4] HLE leaderboard: lastexam.ai
  5. [5] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard
  6. [6] LiveCodeBench: livecodebench.github.io
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.