Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatCustomer-service agent benchmark across retail + airline, with user simulation and policy adherence
WhoYao et al., Sierra AI, 2024 (arXiv:2406.12045)
Board topFrozen original board: Claude 3.5 Sonnet (Oct 2024) 69.2% retail / 46.0% airline; GPT-4o 60.4% / 42.0% (pass^1). Newer models live on the successor tau2-bench.
Repositorygithub.com/sierra-research/tau-bench
Section II.iv · Agent Benchmarks|Reviewed 2026|Tau-Bench board re-verified 17 Jun 2026

Tau-Bench: Customer-Service Agents Under Realistic Policy

The first agent benchmark that treats policy adherence as a first-class metric. Retail and airline scenarios, simulated users, tool-using dialogue, and a clear separation of task success from policy compliance. The benchmark that surfaces over-promising agents, missed identity checks, and the failure modes simpler tool-use evals do not catch.

01

What Tau-Bench measures

Tau-Bench, introduced by Yao et al. at Sierra in June 2024, evaluates dialogue agents on customer-service tasks across two domains: retail and airline. Each task starts with a simulated user request (a return, a cancellation, a question about an account) and the agent must conduct a multi-turn conversation, calling typed API tools against a structured database, until the conversation reaches a terminal state. The agent succeeds when the user's underlying intent is satisfied, the database state reflects the correct outcome, and the trajectory complies with the published policy for the domain.

The benchmark introduces three pieces that previous tool-use benchmarks (BFCL, T-Eval, ToolBench) lacked: a user simulator that drives multi-turn dialogue, explicit per-domain policies that the agent must obey, and stateful scenarios where actions have lasting consequences on the database. These choices are what make Tau-Bench feel closer to real production work than to a synthetic tool-use puzzle. They are also what make the scores harder to beat: an agent that knows which tool to call but breaks policy or stops short of resolution still scores partial-fail.

Tau-Bench is the benchmark Sierra built to evaluate its own customer-service agent product, then open-sourced. This origin is visible in the design: the retail policies look like a real e-commerce returns desk, and the airline policies look like a real reservation system. The realism is the benchmark's primary strength and its primary limitation. The retail and airline domains transfer well to similar business workflows; they transfer poorly to (say) clinical decision support or open-ended research.

02

Two domains: retail and airline

Retail tasks operate on a fictional e-commerce store with products, orders, returns, exchanges, account settings, and reward points. Typical scenarios: "I want to return the shirt I bought last week but I've thrown away the receipt, can you help?" or "I'd like to change the shipping address on order 12345 if it hasn't shipped yet". Retail scenarios average four to seven turns and require three to six tool calls. Database state changes are concrete: a refund processed, an address updated, an order cancelled.

Airline tasks operate on a fictional carrier with itineraries, fare classes, baggage allowances, change fees, seat assignments, and elite status. Typical scenarios: "I need to change my flight from Friday to Sunday but I'm on a non-changeable fare, what are my options?" or "I need to add a checked bag to my booking and add my wife's frequent-flyer number". Airline scenarios average six to ten turns and require five to nine tool calls. The combinatorics of fare rules, baggage allowances, and connected itineraries make airline scenarios harder than retail by a consistent margin.

The retail-to-airline score gap is a stable feature of the leaderboard. The top of the official board is Claude 3.5 Sonnet (20241022) at 69.2 percent on retail and 46.0 percent on airline; GPT-4o sits at 60.4 and 42.0. The roughly 20-point retail-to-airline gap reflects genuine task complexity, not benchmark noise: airline policies are more numerous, the side effects of each action are larger, and the cost of a single wrong tool call is higher.

03

The user simulator and what it means for scoring

The user is an LLM, configured with a system prompt that describes the customer's intent, persona, available information, and patience level. The simulator drives the conversation realistically: it answers clarifying questions, sometimes withholds details until prompted, occasionally rejects the agent's first proposed resolution, and ends the dialogue when satisfied or when patience runs out. This is consequential because the simulator's behaviour affects the score. A more cooperative simulator inflates scores; a more demanding one deflates them.

The official leaderboard fixes the simulator model to GPT-4o with a published system prompt. Community submissions sometimes use a different simulator, which is a comparability hazard worth disclosing. We treat scores reported with a non-standard simulator as research artefacts rather than direct leaderboard comparisons.

The simulator design is also a source of variance: a single run can be lucky or unlucky depending on the conversational path the simulator drives. This is why most published Tau-Bench scores quote pass@4 (best of four independent rollouts per scenario): single-rollout numbers have higher variance and a single bad run can drop a model's score by 2 to 4 points.

04

Policy adherence as a first-class metric

The policy-compliance scoring is the methodological move that most distinguishes Tau-Bench. Each domain ships with a written policy document, and each trajectory is evaluated against it programmatically. A run scores resolution-pass if the user's intent is satisfied and policy-pass if no action violated the policy. The headline score is the conjunction: both must pass for a run to count.

Policy category
What it checks
Identity verification
Agent must verify customer identity before sensitive operations (refunds, address changes, account access). A trajectory that issues a refund without verification scores policy-fail even if the user is satisfied.
Authorisation limits
Some operations require manager approval beyond a threshold (e.g. refunds over $500). Exceeding limits without escalation breaks policy.
Information disclosure
Agent must not disclose other customers' information. The simulator occasionally requests information about a different account; correct response is to refuse, not to comply.
Process completeness
Multi-step processes (e.g. cancellation = refund + record update + notification) must complete every required step. Half-complete trajectories score partial-fail.

The four categories above account for the majority of policy violations in published submissions. The most common single failure mode is identity-verification skipping: an agent under instruction to be helpful sometimes processes a sensitive action without confirming who it is talking to. The second-most-common is process-incompleteness: cancelling a booking without sending a confirmation, or refunding without updating the inventory record. Both failures look fine in conversation but break the system.

05

Board progression and the freeze

The official board ran from the launch GPT-4o baseline to Claude 3.5 Sonnet (20241022), which still tops it at 69.2 percent retail and 46.0 percent airline. Sierra then froze the original tasks: the repository states they are not updated and points new evaluation at the successor tau2-bench (and tau3-bench), which fix known task bugs and add domains. Any 2025-26 model number you see quoted on "Tau-Bench" therefore comes from a vendor self-report or the successor benchmark, not this board, and is not directly comparable to the figures above.

Date
Tier
Note
Jun 2024
Launch: GPT-4o pass^1 retail 60.4% / airline 42.0%
Official board figures (sierra-research/tau-bench); the launch paper, arXiv:2406.12045, originally reported airline at 35.2%.
Oct 2024
Board top: Claude 3.5 Sonnet (20241022) retail 69.2% / airline 46.0%
Highest pass^1 on the official board; second-best is Claude 3.5 Sonnet (20240620) at 62.6% / 36.0%.
2025-26
Original board frozen at the late-2024 model set
The repo states its tasks are not updated and directs new work to the successor tau2-bench / tau3-bench. Newer models (Gemini 2.5 Pro, Claude 4.x, GPT-5) are not on this board.
06

Strengths, limits, and harness sensitivity

The benchmark's strengths are clear: realistic scenarios, separate policy scoring, multi-turn dialogue, well-documented policies, open-source release. The standard configuration is reproducible and the leaderboard accepts third-party submissions. Tau-Bench has become the headline benchmark for customer-service agents in 2026, displacing earlier benchmarks like CB2 and PersuasionForGood that lacked database state or policy scoring.

The limits are also worth knowing. The two domains are narrow: an agent that does well on Tau-Bench has demonstrated competence at retail-and-airline-shaped business workflows, not all of customer service. Pass@4 inflation is a real concern; the difference between pass@1 and pass@4 can be 8 to 12 points for borderline-capable agents. Sierra has a commercial interest in the benchmark, which they have been transparent about, but the closeness between the benchmark design and Sierra's product means the test favours agents that think the way Sierra agents think.

Harness sensitivity is moderate. The standard configuration uses native function calling with a fixed tool inventory and a published system prompt. Some submissions wrap the agent in ReAct-style scratchpads, retrieval over policy documents, or planning loops; these typically add 3 to 6 points. The gap is smaller than on WebArena but still meaningful. See our tool-use benchmark comparison for how Tau-Bench fits alongside BFCL and ToolBench.

07

When to use Tau-Bench in 2026

Use Tau-Bench when your agent: takes API tool calls (not browser actions), conducts multi-turn dialogue with users, operates on stateful systems with measurable side effects, and must obey written policies. That covers a meaningful slice of production agent deployments: customer service, workflow automation, internal IT helpdesk, sales-ops co-pilots. For agents that do not match this shape, look elsewhere: SWE-bench Verified for engineering, WebArena for browser, OSWorld for general computer use, GAIA for assistant-style research tasks.

Editor's verdictTau-Bench is the right benchmark for tool-using customer-service dialogue. The policy-compliance scoring catches failures that other tool-use evals miss. Quote pass@4 with the standard GPT-4o simulator; treat retail and airline as separate numbers rather than averaging.
Reader Questions
Q.01What is Tau-Bench?+
Tau-Bench is a customer-service agent benchmark from Sierra AI, released in May 2024. It evaluates dialogue agents on realistic customer-service scenarios in two domains: retail (returns, exchanges, account questions) and airline (cancellations, rebookings, baggage claims). A simulated user submits a request, the agent must use API tools to read and modify the underlying database, and the trajectory succeeds only if both the user's intent is satisfied and the agent's actions comply with the published policy. The benchmark is publicly available at github.com/sierra-research/tau-bench.
Q.02What makes Tau-Bench different from other tool-use benchmarks?+
Three things. First, Tau-Bench includes a user simulator: the agent must drive a multi-turn conversation, not just take a single tool call. Second, the benchmark scores policy adherence separately from task success, so an agent that resolves the user's request by violating policy is scored partial-fail. Third, the scenarios are written by people with customer-service domain experience, which makes them harder than synthetic tool-use puzzles. The benchmark surfaces failures (over-promising, missing edge cases, breaking policy) that simpler tool-use evals miss.
Q.03How is Tau-Bench scored?+
Two metrics. 'Resolution rate' is the fraction of conversations where the user's intent is satisfied and the database state matches the expected outcome. 'Policy compliance' is the fraction of conversations where every agent action follows the documented policy (no unauthorised refunds, no missed identity checks, no shortcuts). The headline figure is pass^k (pass-hat-k): the fraction of tasks the agent solves on all k independent runs, which measures consistency rather than best-of-k luck. On the official board the highest pass^1 is Claude 3.5 Sonnet (20241022) at 69.2% retail and 46.0% airline, with GPT-4o at 60.4% and 42.0%. That board is frozen at the late-2024 model set, so newer models are evaluated on the successor tau2-bench rather than added here.
Q.04Why is airline harder than retail in Tau-Bench?+
Airline scenarios involve more complex policies: fare rules, baggage allowances, identity verification, change fees, time-window restrictions. They also have more side effects: rebooking a flight may cascade into seat selection, meal preferences, and connected itineraries. Retail scenarios are simpler: a return is a return, an exchange is an exchange. The 20-point gap between retail and airline scores is consistent across the leaderboard.
Q.05What is the role of the user simulator?+
The simulator is itself an LLM, configured with a system prompt that describes the customer's intent, persona, and patience. It drives the conversation: it answers clarifying questions, sometimes provides incomplete information, occasionally rejects the agent's first proposed solution, and ends the dialogue when satisfied (or frustrated). This is consequential because it means a Tau-Bench score depends on the simulator model as well as the agent model. The official leaderboard fixes the simulator to GPT-4o; community submissions sometimes vary it, which is a comparability hazard.
Q.06Is Tau-Bench the right eval for my production agent?+
If your agent does customer-service or workflow automation with tool use across a multi-turn conversation, Tau-Bench is the closest public benchmark to your task. Retail-flavour business logic transfers most cleanly. Airline-flavour scenarios are good for stress-testing policy compliance even outside travel. If your agent is single-turn or read-only, BFCL (Berkeley Function-Calling Leaderboard) is a better fit. If your agent is browser-based rather than API-based, WebArena or OSWorld are closer.
Agent Benchmarks OverviewTool-Use Benchmarks ComparedAgentBenchWebArena MethodologySWE-bench VerifiedGAIA BenchmarkLLM-as-Judge

Sources

  1. [1] Yao, S. et al. (2024). Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
  2. [2] Sierra Research GitHub leaderboard (pass^1 figures cited above). github.com/sierra-research/tau-bench. Accessed 17 Jun 2026; the README notes the original tasks are frozen.
  3. [3] Successor benchmark: github.com/sierra-research/tau2-bench (tau2-bench / tau3-bench), where 2025-26 model evaluations now run.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.