Tau-Bench Retail and Airline: 220 Tasks, Frontier Pass^1 Below 70%
Two customer-service simulations that surface the consistency tax most benchmarks hide.
The Domains
Retail covers a typical consumer e-commerce help desk: order modification, returns, refunds, exchanges, shipping changes. The agent reads a written policy (retail_policy.md is part of the benchmark and is shown to the agent in-context) and must follow it. The simulated user is itself an LLM playing a customer with a goal.
Airline is a tougher version of the same shape. The policies are stricter: refund eligibility depends on fare class, change fees apply, some changes are forbidden, and the agent must look up loyalty-program rules. The action space is larger and the cost of a wrong tool call is higher.
SOTA Progression
The pass^k Insight
Most published agent numbers are best-of-k (try k times, take the best). Tau-Bench refuses that pattern. pass^k requires all k runs to succeed, which surfaces consistency failures that best-of-k hides. The 15-point gap between pass^1 and pass^4 for frontier models on Tau-Bench Retail is the real metric for "would I ship this in production?" because production users do not retry.
Q.01What does the Tau-Bench Retail domain test?+
Q.02What does the Airline domain test?+
Q.03What is pass^k and why is it different from pass@k?+
Q.04What are the headline 2026 numbers?+
Q.05How is Tau-Bench different from BFCL?+
Sources
- [1] Yao et al. (2024): arxiv.org/abs/2406.12045
- [2] Tau-Bench repository: github.com/sierra-research/tau-bench
- [3] Sierra announcement: sierra.ai/blog/benchmarking-ai-agents