Tau-Bench: Customer-Service Agents Under Realistic Policy
The first agent benchmark that treats policy adherence as a first-class metric. Retail and airline scenarios, simulated users, tool-using dialogue, and a clear separation of task success from policy compliance. The benchmark that surfaces over-promising agents, missed identity checks, and the failure modes simpler tool-use evals do not catch.
What Tau-Bench measures
Tau-Bench, introduced by Yao et al. at Sierra in June 2024, evaluates dialogue agents on customer-service tasks across two domains: retail and airline. Each task starts with a simulated user request (a return, a cancellation, a question about an account) and the agent must conduct a multi-turn conversation, calling typed API tools against a structured database, until the conversation reaches a terminal state. The agent succeeds when the user's underlying intent is satisfied, the database state reflects the correct outcome, and the trajectory complies with the published policy for the domain.
The benchmark introduces three pieces that previous tool-use benchmarks (BFCL, T-Eval, ToolBench) lacked: a user simulator that drives multi-turn dialogue, explicit per-domain policies that the agent must obey, and stateful scenarios where actions have lasting consequences on the database. These choices are what make Tau-Bench feel closer to real production work than to a synthetic tool-use puzzle. They are also what make the scores harder to beat: an agent that knows which tool to call but breaks policy or stops short of resolution still scores partial-fail.
Tau-Bench is the benchmark Sierra built to evaluate its own customer-service agent product, then open-sourced. This origin is visible in the design: the retail policies look like a real e-commerce returns desk, and the airline policies look like a real reservation system. The realism is the benchmark's primary strength and its primary limitation. The retail and airline domains transfer well to similar business workflows; they transfer poorly to (say) clinical decision support or open-ended research.
Two domains: retail and airline
Retail tasks operate on a fictional e-commerce store with products, orders, returns, exchanges, account settings, and reward points. Typical scenarios: "I want to return the shirt I bought last week but I've thrown away the receipt, can you help?" or "I'd like to change the shipping address on order 12345 if it hasn't shipped yet". Retail scenarios average four to seven turns and require three to six tool calls. Database state changes are concrete: a refund processed, an address updated, an order cancelled.
Airline tasks operate on a fictional carrier with itineraries, fare classes, baggage allowances, change fees, seat assignments, and elite status. Typical scenarios: "I need to change my flight from Friday to Sunday but I'm on a non-changeable fare, what are my options?" or "I need to add a checked bag to my booking and add my wife's frequent-flyer number". Airline scenarios average six to ten turns and require five to nine tool calls. The combinatorics of fare rules, baggage allowances, and connected itineraries make airline scenarios harder than retail by a consistent margin.
The retail-to-airline score gap is a stable feature of the leaderboard. Frontier models score around 65 percent on retail and around 45 percent on airline in May 2026. The 20-point gap reflects genuine task complexity, not benchmark noise: airline policies are more numerous, the side effects of each action are larger, and the cost of a single wrong tool call is higher.
The user simulator and what it means for scoring
The user is an LLM, configured with a system prompt that describes the customer's intent, persona, available information, and patience level. The simulator drives the conversation realistically: it answers clarifying questions, sometimes withholds details until prompted, occasionally rejects the agent's first proposed resolution, and ends the dialogue when satisfied or when patience runs out. This is consequential because the simulator's behaviour affects the score. A more cooperative simulator inflates scores; a more demanding one deflates them.
The official leaderboard fixes the simulator model to GPT-4o with a published system prompt. Community submissions sometimes use a different simulator, which is a comparability hazard worth disclosing. We treat scores reported with a non-standard simulator as research artefacts rather than direct leaderboard comparisons.
The simulator design is also a source of variance: a single run can be lucky or unlucky depending on the conversational path the simulator drives. This is why most published Tau-Bench scores quote pass@4 (best of four independent rollouts per scenario): single-rollout numbers have higher variance and a single bad run can drop a model's score by 2 to 4 points.
Policy adherence as a first-class metric
The policy-compliance scoring is the methodological move that most distinguishes Tau-Bench. Each domain ships with a written policy document, and each trajectory is evaluated against it programmatically. A run scores resolution-pass if the user's intent is satisfied and policy-pass if no action violated the policy. The headline score is the conjunction: both must pass for a run to count.
The four categories above account for the majority of policy violations in published submissions. The most common single failure mode is identity-verification skipping: an agent under instruction to be helpful sometimes processes a sensitive action without confirming who it is talking to. The second-most-common is process-incompleteness: cancelling a booking without sending a confirmation, or refunding without updating the inventory record. Both failures look fine in conversation but break the system.
SOTA progression May 2024 to May 2026
Tau-Bench scores have climbed steadily since launch. Retail saturated less than airline because retail policies are simpler; airline scores are still moving fastest among current frontier models. The benchmark is one of the cleanest measures of real-world tool-use-plus-policy capability in 2026.
Strengths, limits, and harness sensitivity
The benchmark's strengths are clear: realistic scenarios, separate policy scoring, multi-turn dialogue, well-documented policies, open-source release. The standard configuration is reproducible and the leaderboard accepts third-party submissions. Tau-Bench has become the headline benchmark for customer-service agents in 2026, displacing earlier benchmarks like CB2 and PersuasionForGood that lacked database state or policy scoring.
The limits are also worth knowing. The two domains are narrow: an agent that does well on Tau-Bench has demonstrated competence at retail-and-airline-shaped business workflows, not all of customer service. Pass@4 inflation is a real concern; the difference between pass@1 and pass@4 can be 8 to 12 points for borderline-capable agents. Sierra has a commercial interest in the benchmark, which they have been transparent about, but the closeness between the benchmark design and Sierra's product means the test favours agents that think the way Sierra agents think.
Harness sensitivity is moderate. The standard configuration uses native function calling with a fixed tool inventory and a published system prompt. Some submissions wrap the agent in ReAct-style scratchpads, retrieval over policy documents, or planning loops; these typically add 3 to 6 points. The gap is smaller than on WebArena but still meaningful. See our tool-use benchmark comparison for how Tau-Bench fits alongside BFCL and ToolBench.
When to use Tau-Bench in 2026
Use Tau-Bench when your agent: takes API tool calls (not browser actions), conducts multi-turn dialogue with users, operates on stateful systems with measurable side effects, and must obey written policies. That covers a meaningful slice of production agent deployments: customer service, workflow automation, internal IT helpdesk, sales-ops co-pilots. For agents that do not match this shape, look elsewhere: SWE-bench Verified for engineering, WebArena for browser, OSWorld for general computer use, GAIA for assistant-style research tasks.
Q.01What is Tau-Bench?+
Q.02What makes Tau-Bench different from other tool-use benchmarks?+
Q.03How is Tau-Bench scored?+
Q.04Why is airline harder than retail in Tau-Bench?+
Q.05What is the role of the user simulator?+
Q.06Is Tau-Bench the right eval for my production agent?+
Sources
- [1] Yao, S. et al. (2024). Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
- [2] Sierra Research GitHub. github.com/sierra-research/tau-bench. Accessed May 2026.
- [3] Anthropic Claude model card notes on Tau-Bench performance, 2025. anthropic.com/research.