Abstract

WhatEstimated dollar cost to run 12 popular agent benchmarks per frontier model, single pass-1 run.

Pricing sourceAnthropic and OpenAI public pricing, May 2026.

Reasoning modeIncluded; cost drops 40-70% if disabled (accuracy also drops).

Section IV.vii Methodology|Last verified April 2026

Agent Eval Cost Calculator: Estimate $ to Run SWE-bench, GAIA, TauBench

Order-of-magnitude estimates of what each benchmark costs to run end-to-end at frontier-model pricing.

Reference Cost Table

Estimates assume a single pass-1 run with reasoning mode enabled. Sonnet 4.7 priced at $3 per million input and $15 per million output (including reasoning tokens). GPT-5 priced at $2.50 per million input and $12 per million output. Numbers below are total dollar cost for one full benchmark run.

Benchmark

Tasks

Avg in tok

Avg out tok

Sonnet 4.7

GPT-5

HumanEval

164

1,500

500

$1.20

$0.95

MMLU-Pro

12032

1,200

400

$65

$52

MATH-500

500

1,000

4,000

$24

$19

GPQA-Diamond

198

1,200

6,000

$14

$11

SWE-bench Verified

500

90,000

20,000

$2400

$1900

GAIA

466

25,000

8,000

$380

$300

WebArena

812

60,000

15,000

$1500

$1180

Tau-Bench Retail (pass@1)

115

18,000

6,000

$80

$63

Tau-Bench Airline (pass@1)

105

22,000

7,000

$92

$72

BFCL v3 (all categories)

2000

3,500

800

$38

$30

AssistantBench

214

30,000

7,000

$180

$142

AppWorld

750

28,000

9,000

$650

$510

DIY Formula

For your own model and benchmark estimate, the formula is straightforward. Cost equals tasks times average input tokens times input price per million plus tasks times average output tokens times output price per million, all divided by one million. Pass^k or pass@k runs multiply by k. Multi-seed runs multiply by the number of seeds. Best-of-N runs multiply by N at the model side then run a verifier (usually cheap) at the end.

III

Reading The Table

SWE-bench Verified, WebArena, and AppWorld dominate the cost budget for agent benchmarking because they involve large per-task context and long agent trajectories. MMLU-Pro is expensive in aggregate because it has 12,000 tasks, even though each task is cheap. GPQA-Diamond is small in task count but burns reasoning tokens on hard problems. The total cost ladder roughly tracks task count times average tokens per task, with reasoning-mode multipliers applying on reasoning-heavy benchmarks.

Cost per eval reference →pass@k methodology →Reproducibility considerations →

Reader Questions

Q.01How are these cost estimates calculated?+

For each benchmark we estimate the average input and output token count per task by reviewing the public eval-run logs released by the benchmark maintainers and by reproducing 50 sample tasks per benchmark with a frontier model and recording actuals. We multiply by the public per-million-token pricing for each model. The estimates are accurate within roughly a factor of 1.5 for any given run; your actuals will depend on harness, retry strategy, and reasoning-mode settings.

Q.02Why is SWE-bench Verified so expensive?+

Two reasons. First, the agent reads several files of repository context, often tens of thousands of tokens per task. Second, the agent commonly iterates: read, plan, edit, run tests, debug, re-edit. A pass on a single SWE-bench Verified task averages 80,000 to 250,000 tokens in our reproduction. Multiply by 500 tasks and frontier-model pricing and a full run lands in the low thousands of dollars.

Q.03Are reasoning-mode tokens included?+

Yes. The figures include reasoning tokens (extended thinking, hidden CoT) for models that have them. Reasoning tokens are billed at the same per-token rate as output tokens on Anthropic and OpenAI pricing as of May 2026. Disabling thinking lowers cost by roughly 40 to 70% for the same task at the cost of 5 to 20 points of accuracy depending on the benchmark.

Q.04What is the cheapest way to run a benchmark for regression testing?+

Pick a small representative subset (10 to 25 tasks), pin a fixed seed, run greedy decoding, disable reasoning mode where applicable. This produces a cheap, reproducible signal for change detection in CI. The price drops to roughly 2 to 5% of a full run. Promote to a full run only when the signal moves enough to matter.

Q.05Do these numbers include API throttling or retry?+

No. Per-token cost only. Retries on transient errors typically add 1 to 5% to total cost; rate-limit backoff adds wall-clock time but not dollar cost. Anthropic and OpenAI rate limits as of May 2026 are sufficient for full-benchmark runs without throttling at the Tier 4 and above account levels.

Sources

[1] Anthropic pricing: anthropic.com/pricing
[2] OpenAI pricing: openai.com/api/pricing
[3] SWE-bench leaderboard: swebench.com
[4] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard