Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What220 tasks across two simulated customer-service domains; policy-grounded dialogue with tool calls.
WhoSierra Research (Yao, Mantena, Hingmire, Saraf, Schiff, Murty, Yin, Doulkeridis, Wang).
2026 TierRetail pass^1 around 75%, Airline around 65%; pass^4 around 15 points lower.
Paperarxiv.org/abs/2406.12045
Section II.ix Agent Benchmarks|Last verified April 2026

Tau-Bench Retail and Airline: 220 Tasks, Frontier Pass^1 Below 70%

Two customer-service simulations that surface the consistency tax most benchmarks hide.

I

The Domains

Retail covers a typical consumer e-commerce help desk: order modification, returns, refunds, exchanges, shipping changes. The agent reads a written policy (retail_policy.md is part of the benchmark and is shown to the agent in-context) and must follow it. The simulated user is itself an LLM playing a customer with a goal.

Airline is a tougher version of the same shape. The policies are stricter: refund eligibility depends on fare class, change fees apply, some changes are forbidden, and the agent must look up loyalty-program rules. The action space is larger and the cost of a wrong tool call is higher.

II

SOTA Progression

Date
Tier / Score
Note
Jun 2024
GPT-4o at pass^1 Retail 61.2%, Airline 35.2%
Original Sierra Tau-Bench paper baselines.
Oct 2024
Claude 3.5 Sonnet at pass^1 Retail 69.2%, Airline 46.0%
Anthropic-reported on the launch evaluation.
Mar 2025
Frontier reasoning models reach pass^1 Retail 72%, Airline 58%
o1 and Sonnet thinking modes close the consistency gap.
Apr 2026
Frontier Retail pass^1 around 75%, Airline around 65%
Captured from Sierra public reports.
III

The pass^k Insight

Most published agent numbers are best-of-k (try k times, take the best). Tau-Bench refuses that pattern. pass^k requires all k runs to succeed, which surfaces consistency failures that best-of-k hides. The 15-point gap between pass^1 and pass^4 for frontier models on Tau-Bench Retail is the real metric for "would I ship this in production?" because production users do not retry.

Tau-Bench overviewBFCL for atomic function-callingTool-use benchmarks compared
Reader Questions
Q.01What does the Tau-Bench Retail domain test?+
Tau-Bench Retail (Sierra Research, 2024) is a 115-task simulated retail customer-service environment. The agent talks to a simulated user, must follow company policies on returns, exchanges, refunds, and order modifications, and must call tools (cancel order, issue refund, modify shipping). A task is scored 1 if the final database state matches what policy required, 0 otherwise.
Q.02What does the Airline domain test?+
Tau-Bench Airline is a 105-task simulated airline customer-service environment. Tasks include booking, change-of-itinerary, cancellation, special-meal requests, and frequent-flyer-program lookups. The policies are stricter than Retail (refunds depend on fare class, change fees apply, some changes are forbidden), which makes Airline harder than Retail for most models.
Q.03What is pass^k and why is it different from pass@k?+
pass^k measures consistency: the same task is run k times, and pass^k is the fraction of tasks where the agent succeeds on all k runs. pass@k (HumanEval style) is the fraction of tasks where the agent succeeds at least once across k runs. pass^k is harsher because it requires reproducible success. For agentic systems, pass^k is the more honest signal because users do not retry until something works.
Q.04What are the headline 2026 numbers?+
On Tau-Bench Retail, frontier 2026 models reach pass^1 around 75%, pass^4 around 55%. On Airline, pass^1 sits at roughly 65% with pass^4 in the high 40s. The pass^4 to pass^1 gap is the consistency tax: even strong agents are not deterministic. This gap closes faster on Retail than on Airline because Airline's stricter policy surface punishes any tool-call mistake.
Q.05How is Tau-Bench different from BFCL?+
BFCL evaluates atomic function-calling: pick the right tool, fill the right arguments, in isolation. Tau-Bench evaluates policy-grounded multi-turn dialogue where the agent must combine tool calls with dialogue management, policy adherence, and state tracking. Tau-Bench is a strict superset task. Strong BFCL scores are necessary but not sufficient for strong Tau-Bench scores.

Sources

  1. [1] Yao et al. (2024): arxiv.org/abs/2406.12045
  2. [2] Tau-Bench repository: github.com/sierra-research/tau-bench
  3. [3] Sierra announcement: sierra.ai/blog/benchmarking-ai-agents
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.