Cost Per Eval: What Each Benchmark Costs to Run in 2026
The reference nobody publishes but everyone needs. What it actually costs to run each major benchmark at 2026 frontier-model prices, where the cost goes, how to budget for a launch cycle, and when self-hosting on open weights pays for itself. Triangulated estimates with the assumptions made explicit.
The cost-per-eval question and why the answer is hard
Asking "what does it cost to run benchmark X" sounds straightforward but is in practice surprisingly hard. The answer depends on the underlying model's per-token price (which varies by 10-100x across providers and tiers), the harness configuration (single-shot vs agentic scaffold), the number of repeat rollouts (pass@1 vs pass@4 vs pass@n), the tool execution overhead (hosted browsing, code interpreter, computer use), and several smaller factors. Vendors rarely publish their per-eval costs because the numbers are commercially sensitive and methodology-dependent.
The estimates below are triangulated from three sources: published per-token API prices for frontier models (Anthropic, OpenAI, Google) as of May 2026, harness configurations described in published benchmark submissions and papers, and direct measurement on a subset of benchmarks where we have run evaluations ourselves. Each cost is a range rather than a point estimate because the variation across configurations is genuinely 2-3x. We have made the assumptions explicit in the table below; treat the numbers as planning estimates rather than precise quotes.
The headline pattern: knowledge benchmarks (MMLU-Pro, GPQA-Diamond, HumanEval) cost tens of dollars to run on a frontier model. Agent benchmarks (SWE-bench Verified, GAIA, OSWorld, WebArena) cost hundreds to thousands of dollars. The 100x cost gap reflects fundamental differences in how the benchmarks work: knowledge benchmarks ask one question and get one answer; agent benchmarks orchestrate multi-step trajectories with many model calls per task.
Cost reference table
The table below summarises per-evaluation cost estimates for twelve major benchmarks at two cost tiers: frontier-model API pricing (Claude 4-class, GPT-5-class, Gemini 2-class) and self-hosted open-weight model pricing (Llama-3-70B-class on adequate GPU infrastructure). Costs assume a single complete evaluation run with reasonable production-comparable harness configurations.
The cost ranges reflect harness sensitivity (the same benchmark with a stronger or weaker scaffold can differ in cost by 2-3x), pass@n inflation (pass@4 quadruples cost over pass@1), and per-token price variation across providers. The numbers above assume reasonable production-comparable configurations; vendor-published frontier numbers often use more expensive configurations (extended scaffolding, larger n in pass@n, longer reasoning budgets) that can push individual benchmark costs 4-16x above these ranges.
Cost drivers
Six factors drive evaluation cost. Understanding the relative weight of each is essential for budgeting and for choosing benchmark configurations that fit a given budget.
Tokens per task is the dominant driver for any single benchmark. Best-of-n inference is the dominant driver across configurations. Vision and reasoning each add 30-100 percent depending on usage. Verification loops and tool execution overhead add modest but compounding cost on top. The honest budgeting approach is to estimate token-per-task volume first, multiply by per-token price and number of tasks, then add multiplicative overheads for each additional cost driver in your specific configuration.
Knowledge benchmarks: cheap and reliable
Knowledge benchmarks (MMLU-Pro, GPQA-Diamond, HLE, HumanEval, MATH, BIG-Bench Hard) are cheap because each task is a single question with a single expected answer. Token counts per task range from 1K to 10K depending on chain-of-thought length and reasoning configuration. Total task counts are modest (HLE 2,500, MMLU-Pro 12,000, others smaller). Combined cost on a frontier model is in the tens of dollars per benchmark.
The exception within knowledge benchmarks is HLE with reasoning models and extended thinking budgets. A reasoning model with a generous thinking budget can generate 50K+ tokens per HLE question (most of it discarded reasoning, with a final short answer), which pushes per-task cost 5-10x above non-reasoning configurations. HLE with extended thinking can cost $200-1,000 to run on a frontier reasoning model; without extended thinking it costs $20-50.
For most evaluation purposes, running the headline knowledge benchmarks on a frontier model costs less than $200 total for the full set (MMLU-Pro, GPQA-Diamond, HLE without extended thinking, HumanEval, LiveCodeBench, MATH). This is well within reach of any team running a model evaluation; the cost is rarely the limiting factor for knowledge benchmark coverage.
Agent benchmarks: where the budget goes
Agent benchmarks dominate evaluation budgets. SWE-bench Verified with strong agentic scaffolding can cost $500-2,000 for a single full run; with pass@4 inflation, that becomes $2,000-8,000 for a single confidence-interval-grade evaluation. GAIA and OSWorld are similar; WebArena and Visual WebArena slightly less. Tau-Bench is in the middle of the range.
The cost concentration is at the intersection of high token-per-task volume (50-200K) and high task count (300-900 tasks). A back-of-envelope calculation: 500 tasks at 100K tokens each and a frontier-model price of $10/million tokens = $500 just for the input + output tokens, before harness overhead, verification loops, and tool execution costs. With strong scaffolds those overheads typically add 50-100 percent, giving the $500-2,000 range we observe.
The implication for evaluation budgeting: if you want comprehensive coverage of agent benchmarks (SWE-bench Verified, GAIA, OSWorld, WebArena, Tau-Bench) on a frontier model, budget $5K-15K minimum for single-shot runs and $20K-60K for confidence-interval-grade evaluations with pass@4 or higher. This is well above casual evaluation budgets and is a real reason why the public agent-benchmark leaderboards have fewer entries than the knowledge benchmark leaderboards.
When self-hosting open-weight models pays
Self-hosting open-weight models on adequate GPU infrastructure dramatically reduces per-evaluation cost. A SWE-bench Verified run on Llama-3-70B-class hardware costs primarily the GPU time (around $5-30 for a complete evaluation on adequate hardware) rather than per-token API costs. The same pattern applies to other expensive benchmarks: GAIA, OSWorld, WebArena all become 10-100x cheaper on self-hosted infrastructure.
The trade-offs are real. Open-weight model performance trails frontier closed-source by 15-25 points on most agent benchmarks; self-hosted infrastructure requires significant operational expertise; throughput on a single GPU node is much lower than parallel API calls, so a benchmark run that takes 30 minutes on the API might take 6-12 hours self-hosted. For development-time evaluation where iteration speed matters more than absolute capability ceiling, self-hosting is much cheaper. For final-evaluation comparison against frontier numbers, API costs are usually unavoidable.
A common production pattern: develop and iterate against a self-hosted open-weight model to validate the harness and pipeline, then run the final headline numbers on a frontier API model for the publishable comparison. The development cost is low; the publication cost is irreducible.
Budgeting for a model-launch evaluation cycle
A reasonable budget for evaluating a new frontier model on the headline 2026 benchmarks is $10K-50K depending on harness sophistication. The high end ($50K) supports comprehensive coverage with pass@4 confidence intervals, strong agentic scaffolds, and reasoning-model extended thinking on HLE. The low end ($10K) supports headline single-shot numbers on the major benchmarks without the confidence-interval rigor.
The cheapest viable comprehensive evaluation in 2026 is around $5K with hand-tuned single-shot configurations. This covers the headline knowledge benchmarks ($200), the major agent benchmarks at minimum-viable harness ($3-4K), and Chatbot Arena participation (free, but requires accumulating votes which takes weeks). A serious public model release usually invests at the higher end of the range; a research paper investigating a specific capability slice can usually invest much less.
Whatever the budget, the most important practice is to disclose the configuration alongside the numbers. A $1,500 SWE-bench Verified evaluation and a $15,000 SWE-bench Verified evaluation will produce different scores using the same model; the first is honest if disclosed as "single-shot, baseline harness" and the second is honest if disclosed as "pass@4, strong scaffold with verification loops". The harness sensitivity is real and unavoidable; honest reporting is the only way to make benchmark scores comparable.
Q.01How much does it cost to run a single benchmark in 2026?+
Q.02Why are agent benchmarks so much more expensive?+
Q.03Should I run benchmarks myself or trust published numbers?+
Q.04Are there free ways to evaluate models?+
Q.05What about open-weight models and self-hosted inference?+
Q.06How do I budget for benchmark evaluation in a model-launch cycle?+
Sources
- [1] Anthropic API pricing. anthropic.com/pricing. Accessed May 2026.
- [2] OpenAI API pricing. openai.com/api/pricing. Accessed May 2026.
- [3] Google Gemini API pricing. ai.google.dev/pricing. Accessed May 2026.
- [4] Costs are triangulated estimates based on published per-token prices, benchmark task counts and trajectory lengths described in benchmark papers, and direct measurement on a subset of benchmarks. Treat as planning estimates rather than precise quotes; specific configurations vary by 2-3x in either direction.