BFCL v3: Function-Calling Leaderboard, Frontier Tops 87% Overall
The benchmark every tool-using assistant team checks before shipping.
Construction
BFCL evaluates function-calling: given a user query and a candidate set of function definitions (name, description, parameters, return type), the model must emit a structured tool call. The Berkeley Gorilla team curates roughly 2,000 test cases spread across categories that escalate in difficulty.
v1 (Feb 2024) tested only Simple AST grading: one function in scope, exact-match on the emitted call. v2 (Aug 2024) added live function execution (the call actually runs against a real API), multi-step (the model must call several functions in sequence), and parallel (one turn, multiple calls). v3 (Dec 2024) added Relevance Detection (the model must refuse to call any tool when none is relevant) and Multi-Turn-Long (state tracking across long dialogues).
SOTA Progression
Reading BFCL Scores
Always inspect the per-category breakdown, not just the overall. A model that scores 85% overall by acing Simple and tanking Multi-Turn is a different proposition for production than one that scores 80% by being merely competent across categories. The overall number averages over varying difficulty; the category-level table is where you find which failure mode dominates.
Q.01What is BFCL?+
Q.02What are the BFCL v3 categories?+
Q.03Why does function-calling need a dedicated benchmark?+
Q.04How do frontier models score in 2026?+
Q.05Is BFCL gameable?+
Sources
- [1] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
- [2] BFCL blog and v3 announcement: gorilla.cs.berkeley.edu/blogs/13_bfcl_v3
- [3] Gorilla repository: github.com/ShishirPatil/gorilla