Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatEstimated dollar cost to run 12 popular agent benchmarks per frontier model, single pass-1 run.
Pricing sourceAnthropic and OpenAI public pricing, May 2026.
Reasoning modeIncluded; cost drops 40-70% if disabled (accuracy also drops).
Anthropic pricinganthropic.com/pricing
Section IV.vii Methodology|Last verified April 2026

Agent Eval Cost Calculator: Estimate $ to Run SWE-bench, GAIA, TauBench

Order-of-magnitude estimates of what each benchmark costs to run end-to-end at frontier-model pricing.

I

Reference Cost Table

Estimates assume a single pass-1 run with reasoning mode enabled. Sonnet 4.7 priced at $3 per million input and $15 per million output (including reasoning tokens). GPT-5 priced at $2.50 per million input and $12 per million output. Numbers below are total dollar cost for one full benchmark run.

Benchmark
Tasks
Avg in tok
Avg out tok
Sonnet 4.7
GPT-5
HumanEval
164
1,500
500
$1.20
$0.95
MMLU-Pro
12032
1,200
400
$65
$52
MATH-500
500
1,000
4,000
$24
$19
GPQA-Diamond
198
1,200
6,000
$14
$11
SWE-bench Verified
500
90,000
20,000
$2400
$1900
GAIA
466
25,000
8,000
$380
$300
WebArena
812
60,000
15,000
$1500
$1180
Tau-Bench Retail (pass@1)
115
18,000
6,000
$80
$63
Tau-Bench Airline (pass@1)
105
22,000
7,000
$92
$72
BFCL v3 (all categories)
2000
3,500
800
$38
$30
AssistantBench
214
30,000
7,000
$180
$142
AppWorld
750
28,000
9,000
$650
$510
II

DIY Formula

For your own model and benchmark estimate, the formula is straightforward. Cost equals tasks times average input tokens times input price per million plus tasks times average output tokens times output price per million, all divided by one million. Pass^k or pass@k runs multiply by k. Multi-seed runs multiply by the number of seeds. Best-of-N runs multiply by N at the model side then run a verifier (usually cheap) at the end.

III

Reading The Table

SWE-bench Verified, WebArena, and AppWorld dominate the cost budget for agent benchmarking because they involve large per-task context and long agent trajectories. MMLU-Pro is expensive in aggregate because it has 12,000 tasks, even though each task is cheap. GPQA-Diamond is small in task count but burns reasoning tokens on hard problems. The total cost ladder roughly tracks task count times average tokens per task, with reasoning-mode multipliers applying on reasoning-heavy benchmarks.

Cost per eval referencepass@k methodologyReproducibility considerations
Reader Questions
Q.01How are these cost estimates calculated?+
For each benchmark we estimate the average input and output token count per task by reviewing the public eval-run logs released by the benchmark maintainers and by reproducing 50 sample tasks per benchmark with a frontier model and recording actuals. We multiply by the public per-million-token pricing for each model. The estimates are accurate within roughly a factor of 1.5 for any given run; your actuals will depend on harness, retry strategy, and reasoning-mode settings.
Q.02Why is SWE-bench Verified so expensive?+
Two reasons. First, the agent reads several files of repository context, often tens of thousands of tokens per task. Second, the agent commonly iterates: read, plan, edit, run tests, debug, re-edit. A pass on a single SWE-bench Verified task averages 80,000 to 250,000 tokens in our reproduction. Multiply by 500 tasks and frontier-model pricing and a full run lands in the low thousands of dollars.
Q.03Are reasoning-mode tokens included?+
Yes. The figures include reasoning tokens (extended thinking, hidden CoT) for models that have them. Reasoning tokens are billed at the same per-token rate as output tokens on Anthropic and OpenAI pricing as of May 2026. Disabling thinking lowers cost by roughly 40 to 70% for the same task at the cost of 5 to 20 points of accuracy depending on the benchmark.
Q.04What is the cheapest way to run a benchmark for regression testing?+
Pick a small representative subset (10 to 25 tasks), pin a fixed seed, run greedy decoding, disable reasoning mode where applicable. This produces a cheap, reproducible signal for change detection in CI. The price drops to roughly 2 to 5% of a full run. Promote to a full run only when the signal moves enough to matter.
Q.05Do these numbers include API throttling or retry?+
No. Per-token cost only. Retries on transient errors typically add 1 to 5% to total cost; rate-limit backoff adds wall-clock time but not dollar cost. Anthropic and OpenAI rate limits as of May 2026 are sufficient for full-benchmark runs without throttling at the Tier 4 and above account levels.

Sources

  1. [1] Anthropic pricing: anthropic.com/pricing
  2. [2] OpenAI pricing: openai.com/api/pricing
  3. [3] SWE-bench leaderboard: swebench.com
  4. [4] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.