Abstract

CheapestGPQA-Diamond at $5-15 on a frontier model (single-shot)

Most expensiveSWE-bench Verified with strong scaffold at $500-2,000 per run

Cost driver #1Tokens per task (knowledge: 1-5K vs agent: 30-200K)

Self-hosted10-100x cheaper for development; 15-25 point capability gap to frontier

Section III.iii · Practitioner Reference|Last verified April 2026

Cost Per Eval: What Each Benchmark Costs to Run in 2026

The reference nobody publishes but everyone needs. What it actually costs to run each major benchmark at 2026 frontier-model prices, where the cost goes, how to budget for a launch cycle, and when self-hosting on open weights pays for itself. Triangulated estimates with the assumptions made explicit.

The cost-per-eval question and why the answer is hard

Asking "what does it cost to run benchmark X" sounds straightforward but is in practice surprisingly hard. The answer depends on the underlying model's per-token price (which varies by 10-100x across providers and tiers), the harness configuration (single-shot vs agentic scaffold), the number of repeat rollouts (pass@1 vs pass@4 vs pass@n), the tool execution overhead (hosted browsing, code interpreter, computer use), and several smaller factors. Vendors rarely publish their per-eval costs because the numbers are commercially sensitive and methodology-dependent.

The estimates below are triangulated from three sources: published per-token API prices for frontier models (Anthropic, OpenAI, Google) as of May 2026, harness configurations described in published benchmark submissions and papers, and direct measurement on a subset of benchmarks where we have run evaluations ourselves. Each cost is a range rather than a point estimate because the variation across configurations is genuinely 2-3x. We have made the assumptions explicit in the table below; treat the numbers as planning estimates rather than precise quotes.

The headline pattern: knowledge benchmarks (MMLU-Pro, GPQA-Diamond, HumanEval) cost tens of dollars to run on a frontier model. Agent benchmarks (SWE-bench Verified, GAIA, OSWorld, WebArena) cost hundreds to thousands of dollars. The 100x cost gap reflects fundamental differences in how the benchmarks work: knowledge benchmarks ask one question and get one answer; agent benchmarks orchestrate multi-step trajectories with many model calls per task.

Cost reference table

The table below summarises per-evaluation cost estimates for twelve major benchmarks at two cost tiers: frontier-model API pricing (Claude 4-class, GPT-5-class, Gemini 2-class) and self-hosted open-weight model pricing (Llama-3-70B-class on adequate GPU infrastructure). Costs assume a single complete evaluation run with reasonable production-comparable harness configurations.

Benchmark

Tasks

Tokens/task

Frontier API

Self-hosted

MMLU-Pro (CoT, 5-shot)

12,000

~3K

$10-30

$0.50-2 (self-hosted)

GPQA-Diamond (CoT)

198

~5K

$5-15

$0.10 (self-hosted)

HLE (CoT, reasoning model)

~2,500

~10K (extended thinking)

$50-200

$2-5 (self-hosted)

HumanEval (pass@1)

164

~2K

$3-8

$0.05 (self-hosted)

LiveCodeBench (per window)

~150-300

~3K

$10-30

$0.20-0.50 (self-hosted)

MATH (pass@1)

5,000 (test set)

~4K

$20-60

$0.50-2 (self-hosted)

SWE-bench Verified (agentic)

500

~80K

$500-2,000

$10-50 (self-hosted)

GAIA (test set, with tools)

301

~50K (multi-tool)

$300-1,200

$8-30 (self-hosted)

OSWorld (with vision scaffold)

369

~60K

$500-1,500

$15-50 (self-hosted, requires VL model)

Tau-Bench (pass@4)

~250 + 4 rollouts

~30K (multi-turn)

$200-800

$5-20 (self-hosted)

WebArena (with scaffold)

812

~40K

$400-1,500

$10-40 (self-hosted)

Visual WebArena (multimodal)

910

~50K (with screenshots)

$600-2,000

$15-60 (self-hosted)

The cost ranges reflect harness sensitivity (the same benchmark with a stronger or weaker scaffold can differ in cost by 2-3x), pass@n inflation (pass@4 quadruples cost over pass@1), and per-token price variation across providers. The numbers above assume reasonable production-comparable configurations; vendor-published frontier numbers often use more expensive configurations (extended scaffolding, larger n in pass@n, longer reasoning budgets) that can push individual benchmark costs 4-16x above these ranges.

Cost drivers

Six factors drive evaluation cost. Understanding the relative weight of each is essential for budgeting and for choosing benchmark configurations that fit a given budget.

Driver

Effect on total cost

Tokens per task

Directly proportional to per-task cost. Knowledge benchmarks: 1-5K tokens per question. Agent benchmarks: 30-200K tokens per task across multi-step trajectories. The biggest single cost driver.

Best-of-n inference

Multiplies cost by n. Pass@4 quadruples cost over pass@1. Some published agent scores use best-of-16 or best-of-32 with selection, multiplying cost by 16-32x.

Vision and multimodal

Vision tokens cost more per token than text. OSWorld and Visual WebArena are 30-80 percent more expensive than equivalent text-only benchmarks at the same trajectory length.

Reasoning / extended thinking

Reasoning models with extended thinking budgets generate 2-10x more tokens than non-reasoning models on the same problem. HLE with reasoning models is the clearest case.

Verification loops

Production agentic scaffolds include verification (re-run, check, sanity-test) that adds 30-100 percent overhead on top of the base trajectory cost.

Tool execution overhead

Hosted tools (browsing, code interpreter, computer use) add per-call costs separately from token costs. OpenAI Code Interpreter for example charges per session minute.

Tokens per task is the dominant driver for any single benchmark. Best-of-n inference is the dominant driver across configurations. Vision and reasoning each add 30-100 percent depending on usage. Verification loops and tool execution overhead add modest but compounding cost on top. The honest budgeting approach is to estimate token-per-task volume first, multiply by per-token price and number of tasks, then add multiplicative overheads for each additional cost driver in your specific configuration.

Knowledge benchmarks: cheap and reliable

Knowledge benchmarks (MMLU-Pro, GPQA-Diamond, HLE, HumanEval, MATH, BIG-Bench Hard) are cheap because each task is a single question with a single expected answer. Token counts per task range from 1K to 10K depending on chain-of-thought length and reasoning configuration. Total task counts are modest (HLE 2,500, MMLU-Pro 12,000, others smaller). Combined cost on a frontier model is in the tens of dollars per benchmark.

The exception within knowledge benchmarks is HLE with reasoning models and extended thinking budgets. A reasoning model with a generous thinking budget can generate 50K+ tokens per HLE question (most of it discarded reasoning, with a final short answer), which pushes per-task cost 5-10x above non-reasoning configurations. HLE with extended thinking can cost $200-1,000 to run on a frontier reasoning model; without extended thinking it costs $20-50.

For most evaluation purposes, running the headline knowledge benchmarks on a frontier model costs less than $200 total for the full set (MMLU-Pro, GPQA-Diamond, HLE without extended thinking, HumanEval, LiveCodeBench, MATH). This is well within reach of any team running a model evaluation; the cost is rarely the limiting factor for knowledge benchmark coverage.

Agent benchmarks: where the budget goes

Agent benchmarks dominate evaluation budgets. SWE-bench Verified with strong agentic scaffolding can cost $500-2,000 for a single full run; with pass@4 inflation, that becomes $2,000-8,000 for a single confidence-interval-grade evaluation. GAIA and OSWorld are similar; WebArena and Visual WebArena slightly less. Tau-Bench is in the middle of the range.

The cost concentration is at the intersection of high token-per-task volume (50-200K) and high task count (300-900 tasks). A back-of-envelope calculation: 500 tasks at 100K tokens each and a frontier-model price of $10/million tokens = $500 just for the input + output tokens, before harness overhead, verification loops, and tool execution costs. With strong scaffolds those overheads typically add 50-100 percent, giving the $500-2,000 range we observe.

The implication for evaluation budgeting: if you want comprehensive coverage of agent benchmarks (SWE-bench Verified, GAIA, OSWorld, WebArena, Tau-Bench) on a frontier model, budget $5K-15K minimum for single-shot runs and $20K-60K for confidence-interval-grade evaluations with pass@4 or higher. This is well above casual evaluation budgets and is a real reason why the public agent-benchmark leaderboards have fewer entries than the knowledge benchmark leaderboards.

When self-hosting open-weight models pays

Self-hosting open-weight models on adequate GPU infrastructure dramatically reduces per-evaluation cost. A SWE-bench Verified run on Llama-3-70B-class hardware costs primarily the GPU time (around $5-30 for a complete evaluation on adequate hardware) rather than per-token API costs. The same pattern applies to other expensive benchmarks: GAIA, OSWorld, WebArena all become 10-100x cheaper on self-hosted infrastructure.

The trade-offs are real. Open-weight model performance trails frontier closed-source by 15-25 points on most agent benchmarks; self-hosted infrastructure requires significant operational expertise; throughput on a single GPU node is much lower than parallel API calls, so a benchmark run that takes 30 minutes on the API might take 6-12 hours self-hosted. For development-time evaluation where iteration speed matters more than absolute capability ceiling, self-hosting is much cheaper. For final-evaluation comparison against frontier numbers, API costs are usually unavoidable.

A common production pattern: develop and iterate against a self-hosted open-weight model to validate the harness and pipeline, then run the final headline numbers on a frontier API model for the publishable comparison. The development cost is low; the publication cost is irreducible.

Budgeting for a model-launch evaluation cycle

A reasonable budget for evaluating a new frontier model on the headline 2026 benchmarks is $10K-50K depending on harness sophistication. The high end ($50K) supports comprehensive coverage with pass@4 confidence intervals, strong agentic scaffolds, and reasoning-model extended thinking on HLE. The low end ($10K) supports headline single-shot numbers on the major benchmarks without the confidence-interval rigor.

The cheapest viable comprehensive evaluation in 2026 is around $5K with hand-tuned single-shot configurations. This covers the headline knowledge benchmarks ($200), the major agent benchmarks at minimum-viable harness ($3-4K), and Chatbot Arena participation (free, but requires accumulating votes which takes weeks). A serious public model release usually invests at the higher end of the range; a research paper investigating a specific capability slice can usually invest much less.

Whatever the budget, the most important practice is to disclose the configuration alongside the numbers. A $1,500 SWE-bench Verified evaluation and a $15,000 SWE-bench Verified evaluation will produce different scores using the same model; the first is honest if disclosed as "single-shot, baseline harness" and the second is honest if disclosed as "pass@4, strong scaffold with verification loops". The harness sensitivity is real and unavoidable; honest reporting is the only way to make benchmark scores comparable.

Editor's verdictKnowledge benchmarks cost tens of dollars; agent benchmarks cost hundreds to thousands. Best-of-n inference and strong scaffolds are the dominant cost drivers. Self-hosting open weights pays for development; frontier APIs pay for publication. Budget $10K-50K for comprehensive 2026 model-launch evaluation.

Reader Questions

Q.01How much does it cost to run a single benchmark in 2026?+

Range varies enormously. The cheapest knowledge benchmarks (MMLU-Pro, GPQA-Diamond) cost $5-30 to run on a frontier model with single-shot inference. The mid-cost benchmarks (HumanEval, MATH) cost $20-80. The expensive agent benchmarks (SWE-bench Verified, GAIA, OSWorld with strong scaffolds) cost $200-2,000 for a single full run because they require many tokens per task across long agentic trajectories. Running an agent benchmark with best-of-n inference can multiply the cost by 4 to 16.

Q.02Why are agent benchmarks so much more expensive?+

Three reasons. First, token volume: an agentic trajectory on SWE-bench Verified typically uses 50-200K tokens per task across read-think-act cycles, versus 1-5K tokens for a single MMLU-Pro question. Second, multi-step harnesses: agents make many model calls per task, each with prompt overhead. Third, tool calls and verification: production agentic scaffolds include verification loops that re-run model calls to check work, adding 30-100 percent overhead on top of the base trajectory. The combination of these three factors makes agent benchmarks 100-1000x more expensive per task than knowledge benchmarks.

Q.03Should I run benchmarks myself or trust published numbers?+

It depends on the stakes. For ranking models for production deployment in your specific domain, running your own evaluation on your own data matters more than running public benchmarks. For general capability comparison or for sanity-checking vendor claims, the published numbers are usually correct (when reported with disclosed harnesses) and running them yourself is mostly redundant. The exception is when you suspect the published number used a configuration you cannot reproduce in production (best-of-16 with extended scaffolding, for example); in that case running the cheaper baseline configuration yourself gives a more honest comparison.

Q.04Are there free ways to evaluate models?+

Yes for some benchmarks, no for others. MMLU-Pro and similar academic benchmarks have free-tier API budgets sufficient to run them on small open-weight models. Hosted benchmarks (HumanEval on Replit, GAIA on Hugging Face) sometimes provide free evaluation runs for new models. For frontier closed-source models (Claude, GPT, Gemini), most benchmarks require paid API access. The cheapest path to comprehensive evaluation in 2026 is to use the published vendor numbers for headline comparisons and reserve your eval budget for evaluation on your own data.

Q.05What about open-weight models and self-hosted inference?+

Self-hosted inference on open-weight models can dramatically reduce per-eval cost. A SWE-bench Verified run on a self-hosted Llama-3-70B-class model costs primarily the GPU time (around $5-30 for a full evaluation run on adequate hardware) rather than per-token API costs. The trade-off: self-hosting requires infrastructure expertise and the open-weight model performance gap to frontier closed-source models is real (typically 15-25 points on SWE-bench Verified). For development-time evaluation where capability ceiling matters less than iteration speed, self-hosted is much cheaper.

Q.06How do I budget for benchmark evaluation in a model-launch cycle?+

A reasonable budget for evaluating a new frontier model on the headline benchmarks (MMLU-Pro, GPQA-Diamond, HLE, SWE-bench Verified, LiveCodeBench, GAIA, OSWorld, Tau-Bench, Chatbot Arena participation) is $10K-50K depending on harness sophistication and number of repeat rollouts. The single biggest cost driver is SWE-bench Verified with strong agentic scaffolding ($1K-5K per full run; 3-5 runs needed for confidence intervals). The cheapest comprehensive evaluation budget for a serious model release in 2026 is around $5K with hand-tuned single-shot configurations.

Custom Eval Pipelines →Eval Tools Compared →Production Monitoring →SWE-bench Verified →GAIA Benchmark →OSWorld Benchmark →What Benchmarks Miss →

Sources

[1] Anthropic API pricing. anthropic.com/pricing. Accessed May 2026.
[2] OpenAI API pricing. openai.com/api/pricing. Accessed May 2026.
[3] Google Gemini API pricing. ai.google.dev/pricing. Accessed May 2026.
[4] Costs are triangulated estimates based on published per-token prices, benchmark task counts and trajectory lengths described in benchmark papers, and direct measurement on a subset of benchmarks. Treat as planning estimates rather than precise quotes; specific configurations vary by 2-3x in either direction.