Abstract

WhatCustomer-service agent benchmark across retail + airline, with user simulation and policy adherence

WhoYao et al., Sierra AI, 2024 (arXiv:2406.12045)

2026 TierFrontier: ~65% retail, ~45% airline (pass@1).

Repositorygithub.com/sierra-research/tau-bench

Section II.iv · Agent Benchmarks|Last verified April 2026

Tau-Bench: Customer-Service Agents Under Realistic Policy

The first agent benchmark that treats policy adherence as a first-class metric. Retail and airline scenarios, simulated users, tool-using dialogue, and a clear separation of task success from policy compliance. The benchmark that surfaces over-promising agents, missed identity checks, and the failure modes simpler tool-use evals do not catch.

What Tau-Bench measures

Tau-Bench, introduced by Yao et al. at Sierra in June 2024, evaluates dialogue agents on customer-service tasks across two domains: retail and airline. Each task starts with a simulated user request (a return, a cancellation, a question about an account) and the agent must conduct a multi-turn conversation, calling typed API tools against a structured database, until the conversation reaches a terminal state. The agent succeeds when the user's underlying intent is satisfied, the database state reflects the correct outcome, and the trajectory complies with the published policy for the domain.

The benchmark introduces three pieces that previous tool-use benchmarks (BFCL, T-Eval, ToolBench) lacked: a user simulator that drives multi-turn dialogue, explicit per-domain policies that the agent must obey, and stateful scenarios where actions have lasting consequences on the database. These choices are what make Tau-Bench feel closer to real production work than to a synthetic tool-use puzzle. They are also what make the scores harder to beat: an agent that knows which tool to call but breaks policy or stops short of resolution still scores partial-fail.

Tau-Bench is the benchmark Sierra built to evaluate its own customer-service agent product, then open-sourced. This origin is visible in the design: the retail policies look like a real e-commerce returns desk, and the airline policies look like a real reservation system. The realism is the benchmark's primary strength and its primary limitation. The retail and airline domains transfer well to similar business workflows; they transfer poorly to (say) clinical decision support or open-ended research.

Two domains: retail and airline

Retail tasks operate on a fictional e-commerce store with products, orders, returns, exchanges, account settings, and reward points. Typical scenarios: "I want to return the shirt I bought last week but I've thrown away the receipt, can you help?" or "I'd like to change the shipping address on order 12345 if it hasn't shipped yet". Retail scenarios average four to seven turns and require three to six tool calls. Database state changes are concrete: a refund processed, an address updated, an order cancelled.

Airline tasks operate on a fictional carrier with itineraries, fare classes, baggage allowances, change fees, seat assignments, and elite status. Typical scenarios: "I need to change my flight from Friday to Sunday but I'm on a non-changeable fare, what are my options?" or "I need to add a checked bag to my booking and add my wife's frequent-flyer number". Airline scenarios average six to ten turns and require five to nine tool calls. The combinatorics of fare rules, baggage allowances, and connected itineraries make airline scenarios harder than retail by a consistent margin.

The retail-to-airline score gap is a stable feature of the leaderboard. Frontier models score around 65 percent on retail and around 45 percent on airline in May 2026. The 20-point gap reflects genuine task complexity, not benchmark noise: airline policies are more numerous, the side effects of each action are larger, and the cost of a single wrong tool call is higher.

The user simulator and what it means for scoring

The user is an LLM, configured with a system prompt that describes the customer's intent, persona, available information, and patience level. The simulator drives the conversation realistically: it answers clarifying questions, sometimes withholds details until prompted, occasionally rejects the agent's first proposed resolution, and ends the dialogue when satisfied or when patience runs out. This is consequential because the simulator's behaviour affects the score. A more cooperative simulator inflates scores; a more demanding one deflates them.

The official leaderboard fixes the simulator model to GPT-4o with a published system prompt. Community submissions sometimes use a different simulator, which is a comparability hazard worth disclosing. We treat scores reported with a non-standard simulator as research artefacts rather than direct leaderboard comparisons.

The simulator design is also a source of variance: a single run can be lucky or unlucky depending on the conversational path the simulator drives. This is why most published Tau-Bench scores quote pass@4 (best of four independent rollouts per scenario): single-rollout numbers have higher variance and a single bad run can drop a model's score by 2 to 4 points.

Policy adherence as a first-class metric

The policy-compliance scoring is the methodological move that most distinguishes Tau-Bench. Each domain ships with a written policy document, and each trajectory is evaluated against it programmatically. A run scores resolution-pass if the user's intent is satisfied and policy-pass if no action violated the policy. The headline score is the conjunction: both must pass for a run to count.

Policy category

What it checks

Identity verification

Agent must verify customer identity before sensitive operations (refunds, address changes, account access). A trajectory that issues a refund without verification scores policy-fail even if the user is satisfied.

Authorisation limits

Some operations require manager approval beyond a threshold (e.g. refunds over $500). Exceeding limits without escalation breaks policy.

Information disclosure

Agent must not disclose other customers' information. The simulator occasionally requests information about a different account; correct response is to refuse, not to comply.

Process completeness

Multi-step processes (e.g. cancellation = refund + record update + notification) must complete every required step. Half-complete trajectories score partial-fail.

The four categories above account for the majority of policy violations in published submissions. The most common single failure mode is identity-verification skipping: an agent under instruction to be helpful sometimes processes a sensitive action without confirming who it is talking to. The second-most-common is process-incompleteness: cancelling a booking without sending a confirmation, or refunding without updating the inventory record. Both failures look fine in conversation but break the system.

SOTA progression May 2024 to May 2026

Tau-Bench scores have climbed steadily since launch. Retail saturated less than airline because retail policies are simpler; airline scores are still moving fastest among current frontier models. The benchmark is one of the cleanest measures of real-world tool-use-plus-policy capability in 2026.

Date

Tier

Note

May 2024

Initial release, GPT-4o at 50% retail / 35% airline (pass@1)

Sierra's launch announcement, paper at arXiv:2406.12045.

Oct 2024

Frontier closed-source at 55% retail / 38% airline

Anthropic and OpenAI both publish improved numbers.

Feb 2025

Frontier at 60% retail / 42% airline

Mid-cycle improvement; first credible open-weight submissions.

Sep 2025

Frontier at 64% retail / 45% airline

Sierra publishes second-generation analysis paper.

May 2026

Frontier 60-68% retail / 40-48% airline

Range reflects harness sensitivity; airline still has ~30% headroom.

Strengths, limits, and harness sensitivity

The benchmark's strengths are clear: realistic scenarios, separate policy scoring, multi-turn dialogue, well-documented policies, open-source release. The standard configuration is reproducible and the leaderboard accepts third-party submissions. Tau-Bench has become the headline benchmark for customer-service agents in 2026, displacing earlier benchmarks like CB2 and PersuasionForGood that lacked database state or policy scoring.

The limits are also worth knowing. The two domains are narrow: an agent that does well on Tau-Bench has demonstrated competence at retail-and-airline-shaped business workflows, not all of customer service. Pass@4 inflation is a real concern; the difference between pass@1 and pass@4 can be 8 to 12 points for borderline-capable agents. Sierra has a commercial interest in the benchmark, which they have been transparent about, but the closeness between the benchmark design and Sierra's product means the test favours agents that think the way Sierra agents think.

Harness sensitivity is moderate. The standard configuration uses native function calling with a fixed tool inventory and a published system prompt. Some submissions wrap the agent in ReAct-style scratchpads, retrieval over policy documents, or planning loops; these typically add 3 to 6 points. The gap is smaller than on WebArena but still meaningful. See our tool-use benchmark comparison for how Tau-Bench fits alongside BFCL and ToolBench.

When to use Tau-Bench in 2026

Use Tau-Bench when your agent: takes API tool calls (not browser actions), conducts multi-turn dialogue with users, operates on stateful systems with measurable side effects, and must obey written policies. That covers a meaningful slice of production agent deployments: customer service, workflow automation, internal IT helpdesk, sales-ops co-pilots. For agents that do not match this shape, look elsewhere: SWE-bench Verified for engineering, WebArena for browser, OSWorld for general computer use, GAIA for assistant-style research tasks.

Editor's verdictTau-Bench is the right benchmark for tool-using customer-service dialogue. The policy-compliance scoring catches failures that other tool-use evals miss. Quote pass@4 with the standard GPT-4o simulator; treat retail and airline as separate numbers rather than averaging.

Reader Questions

Q.01What is Tau-Bench?+

Tau-Bench is a customer-service agent benchmark from Sierra AI, released in May 2024. It evaluates dialogue agents on realistic customer-service scenarios in two domains: retail (returns, exchanges, account questions) and airline (cancellations, rebookings, baggage claims). A simulated user submits a request, the agent must use API tools to read and modify the underlying database, and the trajectory succeeds only if both the user's intent is satisfied and the agent's actions comply with the published policy. The benchmark is publicly available at github.com/sierra-research/tau-bench.

Q.02What makes Tau-Bench different from other tool-use benchmarks?+

Three things. First, Tau-Bench includes a user simulator: the agent must drive a multi-turn conversation, not just take a single tool call. Second, the benchmark scores policy adherence separately from task success, so an agent that resolves the user's request by violating policy is scored partial-fail. Third, the scenarios are written by people with customer-service domain experience, which makes them harder than synthetic tool-use puzzles. The benchmark surfaces failures (over-promising, missing edge cases, breaking policy) that simpler tool-use evals miss.

Q.03How is Tau-Bench scored?+

Two metrics. 'Resolution rate' is the fraction of conversations where the user's intent is satisfied and the database state matches the expected outcome. 'Policy compliance' is the fraction of conversations where every agent action follows the documented policy (no unauthorised refunds, no missed identity checks, no shortcuts). Pass@k is also reported, with k=4 being the most-quoted setting. A score of 60% pass@1 on retail and 40% pass@1 on airline is roughly the May 2026 frontier for tool-using dialogue agents.

Q.04Why is airline harder than retail in Tau-Bench?+

Airline scenarios involve more complex policies: fare rules, baggage allowances, identity verification, change fees, time-window restrictions. They also have more side effects: rebooking a flight may cascade into seat selection, meal preferences, and connected itineraries. Retail scenarios are simpler: a return is a return, an exchange is an exchange. The 20-point gap between retail and airline scores is consistent across the leaderboard.

Q.05What is the role of the user simulator?+

The simulator is itself an LLM, configured with a system prompt that describes the customer's intent, persona, and patience. It drives the conversation: it answers clarifying questions, sometimes provides incomplete information, occasionally rejects the agent's first proposed solution, and ends the dialogue when satisfied (or frustrated). This is consequential because it means a Tau-Bench score depends on the simulator model as well as the agent model. The official leaderboard fixes the simulator to GPT-4o; community submissions sometimes vary it, which is a comparability hazard.

Q.06Is Tau-Bench the right eval for my production agent?+

If your agent does customer-service or workflow automation with tool use across a multi-turn conversation, Tau-Bench is the closest public benchmark to your task. Retail-flavour business logic transfers most cleanly. Airline-flavour scenarios are good for stress-testing policy compliance even outside travel. If your agent is single-turn or read-only, BFCL (Berkeley Function-Calling Leaderboard) is a better fit. If your agent is browser-based rather than API-based, WebArena or OSWorld are closer.

Agent Benchmarks Overview →Tool-Use Benchmarks Compared →AgentBench →WebArena Methodology →SWE-bench Verified →GAIA Benchmark →LLM-as-Judge →

Sources

[1] Yao, S. et al. (2024). Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
[2] Sierra Research GitHub. github.com/sierra-research/tau-bench. Accessed May 2026.
[3] Anthropic Claude model card notes on Tau-Bench performance, 2025. anthropic.com/research.