Abstract

WhatMulti-category function-calling benchmark, roughly 2,000 test cases across Simple, Parallel, Multi-Turn, Live, Relevance Detection.

WhoGorilla team, UC Berkeley (Yan, Patil, Chen, et al.).

2026 TierSonnet 4.7 near 87%, GPT-5 around 86%.

Leaderboardgorilla.cs.berkeley.edu/leaderboard

Section II.viii Agent Benchmarks|Last verified April 2026

BFCL v3: Function-Calling Leaderboard, Frontier Tops 87% Overall

The benchmark every tool-using assistant team checks before shipping.

Construction

BFCL evaluates function-calling: given a user query and a candidate set of function definitions (name, description, parameters, return type), the model must emit a structured tool call. The Berkeley Gorilla team curates roughly 2,000 test cases spread across categories that escalate in difficulty.

v1 (Feb 2024) tested only Simple AST grading: one function in scope, exact-match on the emitted call. v2 (Aug 2024) added live function execution (the call actually runs against a real API), multi-step (the model must call several functions in sequence), and parallel (one turn, multiple calls). v3 (Dec 2024) added Relevance Detection (the model must refuse to call any tool when none is relevant) and Multi-Turn-Long (state tracking across long dialogues).

SOTA Progression

Date

Tier / Score

Note

Feb 2024

GPT-4 at 84.5% (BFCL v1 overall AST)

Original Gorilla release.

Aug 2024

BFCL v2 launches, GPT-4o around 80.6%

Multi-step and live categories added; harder benchmark.

Dec 2024

BFCL v3 launches, Claude 3.5 Sonnet at 82.3%

Relevance detection and Multi-Turn-Long categories added.

May 2026

Sonnet 4.7 near 87%, GPT-5 around 86%, Gemini 3 around 84%

Top of the leaderboard; captured from BFCL v3 leaderboard.

III

Reading BFCL Scores

Always inspect the per-category breakdown, not just the overall. A model that scores 85% overall by acing Simple and tanking Multi-Turn is a different proposition for production than one that scores 80% by being merely competent across categories. The overall number averages over varying difficulty; the category-level table is where you find which failure mode dominates.

Tool-use benchmarks compared →Tau-Bench retail and airline →OpenAI Agents SDK on BFCL →

Reader Questions

Q.01What is BFCL?+

BFCL (Berkeley Function-Calling Leaderboard) is the standard benchmark for evaluating function-calling, the LLM capability where the model decides which tool to invoke and with what arguments. Maintained by the Gorilla team at UC Berkeley, v3 expanded the original v1 (simple AST-grading) into multi-step, parallel, and live categories.

Q.02What are the BFCL v3 categories?+

BFCL v3 evaluates across roughly seven categories: Simple (one function, AST match), Multiple (pick one from a candidate set), Parallel (call several functions in one turn), Parallel-Multiple (combine the previous two), Multi-Turn (track tool state across turns), Live (real APIs invoked, not just AST graded), and Relevance Detection (refuse if no tool is relevant). The headline number is the unweighted accuracy across all categories.

Q.03Why does function-calling need a dedicated benchmark?+

Function-calling failure modes do not show up in MMLU or HumanEval. A model can write correct Python but invent function signatures, hallucinate parameter names, fail to handle ambiguous user intent, or call a function when the user did not ask for a tool. BFCL surfaces these failure modes directly and is the only benchmark most function-calling teams trust as a regression signal.

Q.04How do frontier models score in 2026?+

Sonnet 4.7 sits at the top of the BFCL v3 leaderboard near 87% overall, with GPT-5 close behind. Mid-tier closed models cluster in the high 70s. Strong open-weight tool-tuned models (Hammer, Functionary-v3, xLAM) reach the low 70s. Smaller open-weight models score below 50, especially on multi-turn and relevance detection categories.

Q.05Is BFCL gameable?+

Partially. AST grading rewards exact match on function names and arguments, which means models trained on similar synthetic data (a common practice) score higher on Simple and Multiple categories than their underlying tool-use capability deserves. Live and Multi-Turn categories are more robust because they require correct tool execution and state tracking, not just lexical match.

Sources

[1] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
[2] BFCL blog and v3 announcement: gorilla.cs.berkeley.edu/blogs/13_bfcl_v3
[3] Gorilla repository: github.com/ShishirPatil/gorilla