Abstract

What220 tasks across two simulated customer-service domains; policy-grounded dialogue with tool calls.

WhoSierra Research (Yao, Mantena, Hingmire, Saraf, Schiff, Murty, Yin, Doulkeridis, Wang).

2026 TierRetail pass^1 around 75%, Airline around 65%; pass^4 around 15 points lower.

Paperarxiv.org/abs/2406.12045

Section II.ix Agent Benchmarks|Last verified April 2026

Tau-Bench Retail and Airline: 220 Tasks, Frontier Pass^1 Below 70%

Two customer-service simulations that surface the consistency tax most benchmarks hide.

The Domains

Retail covers a typical consumer e-commerce help desk: order modification, returns, refunds, exchanges, shipping changes. The agent reads a written policy (retail_policy.md is part of the benchmark and is shown to the agent in-context) and must follow it. The simulated user is itself an LLM playing a customer with a goal.

Airline is a tougher version of the same shape. The policies are stricter: refund eligibility depends on fare class, change fees apply, some changes are forbidden, and the agent must look up loyalty-program rules. The action space is larger and the cost of a wrong tool call is higher.

SOTA Progression

Date

Tier / Score

Note

Jun 2024

GPT-4o at pass^1 Retail 61.2%, Airline 35.2%

Original Sierra Tau-Bench paper baselines.

Oct 2024

Claude 3.5 Sonnet at pass^1 Retail 69.2%, Airline 46.0%

Anthropic-reported on the launch evaluation.

Mar 2025

Frontier reasoning models reach pass^1 Retail 72%, Airline 58%

o1 and Sonnet thinking modes close the consistency gap.

Apr 2026

Frontier Retail pass^1 around 75%, Airline around 65%

Captured from Sierra public reports.

III

The pass^k Insight

Most published agent numbers are best-of-k (try k times, take the best). Tau-Bench refuses that pattern. pass^k requires all k runs to succeed, which surfaces consistency failures that best-of-k hides. The 15-point gap between pass^1 and pass^4 for frontier models on Tau-Bench Retail is the real metric for "would I ship this in production?" because production users do not retry.

Tau-Bench overview →BFCL for atomic function-calling →Tool-use benchmarks compared →

Reader Questions

Q.01What does the Tau-Bench Retail domain test?+

Tau-Bench Retail (Sierra Research, 2024) is a 115-task simulated retail customer-service environment. The agent talks to a simulated user, must follow company policies on returns, exchanges, refunds, and order modifications, and must call tools (cancel order, issue refund, modify shipping). A task is scored 1 if the final database state matches what policy required, 0 otherwise.

Q.02What does the Airline domain test?+

Tau-Bench Airline is a 105-task simulated airline customer-service environment. Tasks include booking, change-of-itinerary, cancellation, special-meal requests, and frequent-flyer-program lookups. The policies are stricter than Retail (refunds depend on fare class, change fees apply, some changes are forbidden), which makes Airline harder than Retail for most models.

Q.03What is pass^k and why is it different from pass@k?+

pass^k measures consistency: the same task is run k times, and pass^k is the fraction of tasks where the agent succeeds on all k runs. pass@k (HumanEval style) is the fraction of tasks where the agent succeeds at least once across k runs. pass^k is harsher because it requires reproducible success. For agentic systems, pass^k is the more honest signal because users do not retry until something works.

Q.04What are the headline 2026 numbers?+

On Tau-Bench Retail, frontier 2026 models reach pass^1 around 75%, pass^4 around 55%. On Airline, pass^1 sits at roughly 65% with pass^4 in the high 40s. The pass^4 to pass^1 gap is the consistency tax: even strong agents are not deterministic. This gap closes faster on Retail than on Airline because Airline's stricter policy surface punishes any tool-call mistake.

Q.05How is Tau-Bench different from BFCL?+

BFCL evaluates atomic function-calling: pick the right tool, fill the right arguments, in isolation. Tau-Bench evaluates policy-grounded multi-turn dialogue where the agent must combine tool calls with dialogue management, policy adherence, and state tracking. Tau-Bench is a strict superset task. Strong BFCL scores are necessary but not sufficient for strong Tau-Bench scores.

Sources

[1] Yao et al. (2024): arxiv.org/abs/2406.12045
[2] Tau-Bench repository: github.com/sierra-research/tau-bench
[3] Sierra announcement: sierra.ai/blog/benchmarking-ai-agents