Agent Eval Cost Calculator: Estimate $ to Run SWE-bench, GAIA, TauBench
Order-of-magnitude estimates of what each benchmark costs to run end-to-end at frontier-model pricing.
Reference Cost Table
Estimates assume a single pass-1 run with reasoning mode enabled. Sonnet 4.7 priced at $3 per million input and $15 per million output (including reasoning tokens). GPT-5 priced at $2.50 per million input and $12 per million output. Numbers below are total dollar cost for one full benchmark run.
DIY Formula
For your own model and benchmark estimate, the formula is straightforward. Cost equals tasks times average input tokens times input price per million plus tasks times average output tokens times output price per million, all divided by one million. Pass^k or pass@k runs multiply by k. Multi-seed runs multiply by the number of seeds. Best-of-N runs multiply by N at the model side then run a verifier (usually cheap) at the end.
Reading The Table
SWE-bench Verified, WebArena, and AppWorld dominate the cost budget for agent benchmarking because they involve large per-task context and long agent trajectories. MMLU-Pro is expensive in aggregate because it has 12,000 tasks, even though each task is cheap. GPQA-Diamond is small in task count but burns reasoning tokens on hard problems. The total cost ladder roughly tracks task count times average tokens per task, with reasoning-mode multipliers applying on reasoning-heavy benchmarks.
Q.01How are these cost estimates calculated?+
Q.02Why is SWE-bench Verified so expensive?+
Q.03Are reasoning-mode tokens included?+
Q.04What is the cheapest way to run a benchmark for regression testing?+
Q.05Do these numbers include API throttling or retry?+
Sources
- [1] Anthropic pricing: anthropic.com/pricing
- [2] OpenAI pricing: openai.com/api/pricing
- [3] SWE-bench leaderboard: swebench.com
- [4] GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard