Tool-Use Benchmarks: The 2026 Selection Guide
Tool-use is several different capabilities: select the right function, parameterise it correctly, sequence multiple calls, conduct a dialogue around the calls, comply with policies. Different benchmarks measure different slices. BFCL for function-calling capability, Tau-Bench for tool-using dialogue, ToolBench and T-Eval for foundational and research use. Pick by agent shape.
Tool-use is several capabilities
Tool-use looks like a single agent capability from the outside but decomposes into several distinct sub-capabilities. Function selection: given a user request and a tool inventory, identify the correct tool to call. Parameterisation: extract the parameters for the tool from the request. Sequencing: chain multiple tool calls when the result of one informs the next. Dialogue management: conduct a multi-turn conversation around the tool calls, asking clarifying questions or summarising results. Policy compliance: ensure the tool calls and dialogue adhere to documented policies (no unauthorised actions, identity verification before sensitive operations, scope limits).
Different benchmarks evaluate different subsets of these sub-capabilities. BFCL focuses on function selection, parameterisation, and sequencing. Tau-Bench adds dialogue management and policy compliance. ToolBench tests breadth of tool selection across many APIs. T-Eval breaks down the sub-capabilities for diagnostic purposes. The right benchmark depends on which sub-capability you want to evaluate.
The most common evaluation mistake is to quote a single benchmark and treat it as a complete tool-use claim. A model with high BFCL accuracy might still struggle on Tau-Bench because Tau-Bench requires policy adherence and multi-turn dialogue management on top of function calling. A model that does well on Tau-Bench has demonstrated function-calling-plus-dialogue but might not generalise to a wider tool inventory tested by ToolBench. Honest tool-use claims quote two benchmarks at minimum.
Benchmark-by-benchmark comparison
The full picture of tool-use benchmarks in 2026 spans the four headline benchmarks plus several more specialised options. The summary below lays out what each measures, the frontier, the strengths, the weaknesses, and the recommendation.
Use-case-by-use-case selection guide
The right benchmark depends on the agent shape. Function-calling unit-tests are different from production customer-service evaluations; both are different from research-stage diagnosis. The table below maps common use cases to recommended benchmarks.
BFCL: the function-calling primitive
BFCL (Berkeley Function-Calling Leaderboard) is the canonical benchmark for raw function-calling capability. It tests function selection, parameterisation, and sequencing across thousands of synthesised scenarios spanning several categories: simple single-call (one function, one set of parameters), parallel multi-call (multiple independent function calls in one turn), multi-turn (where the result of one call informs the next), and several edge cases (irrelevance detection where no function should be called, ambiguity resolution, missing parameters).
Frontier models in 2026 score 85-92 percent overall on BFCL, with significant per-category variation. Simple single-call is essentially saturated for frontier models; multi-turn is where the gap between top models and mid-tier models is largest. The benchmark is well-instrumented and reproducible; community submissions are common, and the leaderboard at gorilla.cs.berkeley.edu/leaderboard.html is updated regularly. BFCL is the right benchmark to quote for raw function-calling capability claims.
The main limitation is that BFCL scenarios are synthesised rather than from real production logs. The function-call patterns are realistic but not necessarily representative of the long tail of production scenarios. For production-realism, Tau-Bench is the closer benchmark; for breadth across many real APIs, ToolBench remains relevant.
Tau-Bench: tool-using dialogue with policy
Tau-Bench evaluates tool-using dialogue agents on customer-service scenarios across two domains (retail and airline). Each scenario includes a simulated user, a tool inventory exposing CRUD operations on a structured database, and a written policy the agent must obey. Success requires the user's intent to be satisfied, the database state to match the expected outcome, and every action to comply with the policy.
Tau-Bench's most distinctive feature is policy compliance as a first-class metric. Other tool-use benchmarks check whether the agent called the right tool with the right parameters; Tau-Bench additionally checks whether the agent's actions violated any documented policy. This catches a class of failures (over-promising, missed identity checks, unauthorised actions) that simpler tool-use evals miss entirely.
The benchmark's 2026 frontier is around 60-65 percent on retail and 40-48 percent on airline (pass@1). The retail-airline gap reflects genuine task complexity rather than benchmark noise: airline policies are more numerous and the side effects of each action are larger. For tool-using customer-service or workflow-automation agents, Tau-Bench is the right headline benchmark.
ToolBench and T-Eval: foundational and diagnostic
ToolBench (Qin et al. 2023) was the first large-scale tool-use benchmark covering 16,000+ real APIs across many categories. It remains useful for testing breadth of tool selection across a wide inventory but has been partly superseded by BFCL for reproducibility and Tau-Bench for production-shape relevance. ToolBench's modern role is as a foundational reference and a source of realistic API descriptions; new model claims rarely lead with ToolBench numbers.
T-Eval (Shanghai AI Lab) provides finer-grained tool-use evaluation across six aspects: instruction following, planning, reasoning, retrieval, understanding, and review. Each aspect has its own sub-benchmark and per-aspect scoring. T-Eval is the right benchmark for research that wants to identify which specific tool-use sub-skill is the bottleneck for a given model. It is less commonly cited in production-deployment contexts where the headline pass-or-fail metric matters more than the per-aspect diagnosis.
Combining tool-use benchmarks
The honest 2026 pattern for tool-use claims is to quote BFCL plus Tau-Bench. BFCL gives the function-calling primitive; Tau-Bench gives the production-shape integration. The combination covers the headline tool-use capabilities better than either benchmark alone. Add T-Eval when aspect-level diagnosis is needed; add ToolBench when wide-API-inventory breadth is the question.
As with all agent benchmarks, harness sensitivity matters. The same model with native function-calling support and a structured prompt scaffold scores meaningfully higher than the same model in a free-form text-completion harness with hand-parsed function calls. When citing a tool-use score, disclose the function-calling configuration: native function-calling API or text-based parsing, structured outputs schema or free-form, single-call or parallel-call enabled. The configuration is part of the score, not a separable variable.
Q.01Which tool-use benchmark should I quote in 2026?+
Q.02What does BFCL measure?+
Q.03How is Tau-Bench different from BFCL?+
Q.04Is ToolBench still relevant?+
Q.05What is T-Eval and where does it fit?+
Q.06What about ToolACE and newer benchmarks?+
Sources
- [1] Berkeley Function-Calling Leaderboard. gorilla.cs.berkeley.edu/leaderboard.html. Accessed May 2026.
- [2] Yao, S. et al. (2024). Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
- [3] Qin, Y. et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv:2307.16789. The ToolBench paper.
- [4] Chen, Z. et al. (2023). T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step. arXiv:2312.14033.