Abstract

Function-calling headlineBFCL (Berkeley Function-Calling Leaderboard)

Tool-using dialogue headlineTau-Bench

Foundational referenceToolBench (16K APIs); T-Eval (aspect-level)

2026 frontierBFCL 85-92%; Tau-Bench retail 60-65%, airline 40-48%

Section II.xv · Benchmark Comparison|Last verified April 2026

Tool-Use Benchmarks: The 2026 Selection Guide

Tool-use is several different capabilities: select the right function, parameterise it correctly, sequence multiple calls, conduct a dialogue around the calls, comply with policies. Different benchmarks measure different slices. BFCL for function-calling capability, Tau-Bench for tool-using dialogue, ToolBench and T-Eval for foundational and research use. Pick by agent shape.

Tool-use is several capabilities

Tool-use looks like a single agent capability from the outside but decomposes into several distinct sub-capabilities. Function selection: given a user request and a tool inventory, identify the correct tool to call. Parameterisation: extract the parameters for the tool from the request. Sequencing: chain multiple tool calls when the result of one informs the next. Dialogue management: conduct a multi-turn conversation around the tool calls, asking clarifying questions or summarising results. Policy compliance: ensure the tool calls and dialogue adhere to documented policies (no unauthorised actions, identity verification before sensitive operations, scope limits).

Different benchmarks evaluate different subsets of these sub-capabilities. BFCL focuses on function selection, parameterisation, and sequencing. Tau-Bench adds dialogue management and policy compliance. ToolBench tests breadth of tool selection across many APIs. T-Eval breaks down the sub-capabilities for diagnostic purposes. The right benchmark depends on which sub-capability you want to evaluate.

The most common evaluation mistake is to quote a single benchmark and treat it as a complete tool-use claim. A model with high BFCL accuracy might still struggle on Tau-Bench because Tau-Bench requires policy adherence and multi-turn dialogue management on top of function calling. A model that does well on Tau-Bench has demonstrated function-calling-plus-dialogue but might not generalise to a wider tool inventory tested by ToolBench. Honest tool-use claims quote two benchmarks at minimum.

Benchmark-by-benchmark comparison

The full picture of tool-use benchmarks in 2026 spans the four headline benchmarks plus several more specialised options. The summary below lays out what each measures, the frontier, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

2026 frontier

Note

Recommend

BFCL

Function-calling across single-turn, parallel, multi-turn, edge cases

Frontier 85-92% accuracy

Synthetic scenarios; not full dialogue

Yes (function-calling headline)

Tau-Bench

Multi-turn customer-service with policy compliance

Retail 60-65%, airline 40-48% pass@1

Two domains only; pass@4 inflates scores

Yes (tool-using dialogue)

ToolBench

16K+ real APIs across many categories

Variable; less commonly cited in 2026

Less reproducible than BFCL

Foundational

T-Eval

Six-aspect fine-grained tool-use evaluation

Per-aspect; less consolidated

Less production-relevant

Research use

ToolACE

Synthesised complex tool-use scenarios

Newer; limited public claims

Less established

Watch (emerging)

API-Bank

API call sequences across 53 APIs

Saturating

Smaller scale; saturating

Niche

Use-case-by-use-case selection guide

The right benchmark depends on the agent shape. Function-calling unit-tests are different from production customer-service evaluations; both are different from research-stage diagnosis. The table below maps common use cases to recommended benchmarks.

Use case

Primary benchmark

Secondary

Function-calling capability evaluation

BFCL

T-Eval for diagnosis

Customer-service / workflow agents

Tau-Bench

BFCL multi-turn

Tool-augmented assistant (consumer-facing)

Tau-Bench retail

GAIA for breadth

Internal IT helpdesk / knowledge work

Tau-Bench airline

BFCL parallel

Wide tool inventory / API selection breadth

ToolBench

BFCL

Research / aspect-level diagnosis

T-Eval

BFCL

BFCL: the function-calling primitive

BFCL (Berkeley Function-Calling Leaderboard) is the canonical benchmark for raw function-calling capability. It tests function selection, parameterisation, and sequencing across thousands of synthesised scenarios spanning several categories: simple single-call (one function, one set of parameters), parallel multi-call (multiple independent function calls in one turn), multi-turn (where the result of one call informs the next), and several edge cases (irrelevance detection where no function should be called, ambiguity resolution, missing parameters).

Frontier models in 2026 score 85-92 percent overall on BFCL, with significant per-category variation. Simple single-call is essentially saturated for frontier models; multi-turn is where the gap between top models and mid-tier models is largest. The benchmark is well-instrumented and reproducible; community submissions are common, and the leaderboard at gorilla.cs.berkeley.edu/leaderboard.html is updated regularly. BFCL is the right benchmark to quote for raw function-calling capability claims.

The main limitation is that BFCL scenarios are synthesised rather than from real production logs. The function-call patterns are realistic but not necessarily representative of the long tail of production scenarios. For production-realism, Tau-Bench is the closer benchmark; for breadth across many real APIs, ToolBench remains relevant.

Tau-Bench: tool-using dialogue with policy

Tau-Bench evaluates tool-using dialogue agents on customer-service scenarios across two domains (retail and airline). Each scenario includes a simulated user, a tool inventory exposing CRUD operations on a structured database, and a written policy the agent must obey. Success requires the user's intent to be satisfied, the database state to match the expected outcome, and every action to comply with the policy.

Tau-Bench's most distinctive feature is policy compliance as a first-class metric. Other tool-use benchmarks check whether the agent called the right tool with the right parameters; Tau-Bench additionally checks whether the agent's actions violated any documented policy. This catches a class of failures (over-promising, missed identity checks, unauthorised actions) that simpler tool-use evals miss entirely.

The benchmark's 2026 frontier is around 60-65 percent on retail and 40-48 percent on airline (pass@1). The retail-airline gap reflects genuine task complexity rather than benchmark noise: airline policies are more numerous and the side effects of each action are larger. For tool-using customer-service or workflow-automation agents, Tau-Bench is the right headline benchmark.

ToolBench and T-Eval: foundational and diagnostic

ToolBench (Qin et al. 2023) was the first large-scale tool-use benchmark covering 16,000+ real APIs across many categories. It remains useful for testing breadth of tool selection across a wide inventory but has been partly superseded by BFCL for reproducibility and Tau-Bench for production-shape relevance. ToolBench's modern role is as a foundational reference and a source of realistic API descriptions; new model claims rarely lead with ToolBench numbers.

T-Eval (Shanghai AI Lab) provides finer-grained tool-use evaluation across six aspects: instruction following, planning, reasoning, retrieval, understanding, and review. Each aspect has its own sub-benchmark and per-aspect scoring. T-Eval is the right benchmark for research that wants to identify which specific tool-use sub-skill is the bottleneck for a given model. It is less commonly cited in production-deployment contexts where the headline pass-or-fail metric matters more than the per-aspect diagnosis.

Combining tool-use benchmarks

The honest 2026 pattern for tool-use claims is to quote BFCL plus Tau-Bench. BFCL gives the function-calling primitive; Tau-Bench gives the production-shape integration. The combination covers the headline tool-use capabilities better than either benchmark alone. Add T-Eval when aspect-level diagnosis is needed; add ToolBench when wide-API-inventory breadth is the question.

As with all agent benchmarks, harness sensitivity matters. The same model with native function-calling support and a structured prompt scaffold scores meaningfully higher than the same model in a free-form text-completion harness with hand-parsed function calls. When citing a tool-use score, disclose the function-calling configuration: native function-calling API or text-based parsing, structured outputs schema or free-form, single-call or parallel-call enabled. The configuration is part of the score, not a separable variable.

Editor's verdictFor tool-use claims in 2026, quote BFCL for function-calling capability and Tau-Bench for tool-using dialogue. Add T-Eval for diagnosis or ToolBench for breadth as the question requires. Disclose the function-calling configuration; treat single-benchmark tool-use claims as incomplete.

Reader Questions

Q.01Which tool-use benchmark should I quote in 2026?+

It depends on the agent shape. For single-call function-selection (e.g. 'pick the right API to call given a user request'), quote BFCL (Berkeley Function-Calling Leaderboard). For multi-turn tool-using dialogue with policy compliance, quote Tau-Bench. For broad tool-call plans across many APIs, quote ToolBench. For step-level tool-use evaluation in research, quote T-Eval. The four benchmarks measure different aspects of tool-use capability and are not directly substitutable.

Q.02What does BFCL measure?+

BFCL (Berkeley Function-Calling Leaderboard, hosted at gorilla.cs.berkeley.edu/leaderboard.html) measures function-calling capability across thousands of single-turn and multi-turn function-call scenarios. Categories include simple single-call, parallel multi-call, multi-turn (where the result of one call informs the next), and edge cases (irrelevance detection, ambiguity resolution). The benchmark is the canonical reference for raw function-calling capability and is well-instrumented for comparing models on this specific axis.

Q.03How is Tau-Bench different from BFCL?+

Scope and structure. BFCL tests function-calling capability in isolation: given a request, can the model select and parameterise the right function call? Tau-Bench tests tool-using dialogue in context: given a customer-service scenario with a real policy, can the model conduct a multi-turn conversation that uses tools correctly while complying with policy and reaching the user's actual goal? BFCL is a unit test of the function-calling primitive; Tau-Bench is an integration test of the agent built on top of it.

Q.04Is ToolBench still relevant?+

Partly. ToolBench (Qin et al. 2023) was an early large-scale tool-use benchmark covering 16,000+ real APIs across many categories. It remains useful for testing breadth of tool selection but has been partly superseded by BFCL for reproducibility and Tau-Bench for production-shape relevance. ToolBench's primary modern role is as a research artefact and a source of API descriptions; new model claims usually quote BFCL or Tau-Bench rather than ToolBench.

Q.05What is T-Eval and where does it fit?+

T-Eval (Tool Evaluation) is a finer-grained tool-use benchmark from Shanghai AI Lab that evaluates six aspects of tool-use behaviour: instruction following, planning, reasoning, retrieval, understanding, and review. Each aspect has its own sub-benchmark, allowing per-aspect scoring. T-Eval is the right benchmark for research that wants to identify which specific tool-use sub-skill is the bottleneck; less commonly cited for production claims.

Q.06What about ToolACE and newer benchmarks?+

ToolACE is a newer benchmark released in 2024 focused on tool-augmented complex evaluation; it includes synthesised tool-use scenarios designed to stress test specific failure modes. The benchmark is maturing and gaining adoption but has not yet displaced BFCL as the canonical reference. Other newer benchmarks (NovelQA, ChatHF, MTU-Bench) cover specific niches but none has emerged as a headline replacement for the established four. Expect the landscape to consolidate further by 2027.

Tau-Bench Deep Dive →Agent Benchmarks Overview →AgentBench →Coding-Agent Benchmarks Compared →Browser-Agent Benchmarks Compared →OpenAI Agents SDK →What Benchmarks Miss →

Sources

[1] Berkeley Function-Calling Leaderboard. gorilla.cs.berkeley.edu/leaderboard.html. Accessed May 2026.
[2] Yao, S. et al. (2024). Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
[3] Qin, Y. et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789. The ToolBench paper.
[4] Chen, Z. et al. (2023). T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. arXiv:2312.14033.