Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatMulti-category function-calling benchmark, roughly 2,000 test cases across Simple, Parallel, Multi-Turn, Live, Relevance Detection.
WhoGorilla team, UC Berkeley (Yan, Patil, Chen, et al.).
2026 TierSonnet 4.7 near 87%, GPT-5 around 86%.
Leaderboardgorilla.cs.berkeley.edu/leaderboard
Section II.viii Agent Benchmarks|Last verified April 2026

BFCL v3: Function-Calling Leaderboard, Frontier Tops 87% Overall

The benchmark every tool-using assistant team checks before shipping.

I

Construction

BFCL evaluates function-calling: given a user query and a candidate set of function definitions (name, description, parameters, return type), the model must emit a structured tool call. The Berkeley Gorilla team curates roughly 2,000 test cases spread across categories that escalate in difficulty.

v1 (Feb 2024) tested only Simple AST grading: one function in scope, exact-match on the emitted call. v2 (Aug 2024) added live function execution (the call actually runs against a real API), multi-step (the model must call several functions in sequence), and parallel (one turn, multiple calls). v3 (Dec 2024) added Relevance Detection (the model must refuse to call any tool when none is relevant) and Multi-Turn-Long (state tracking across long dialogues).

II

SOTA Progression

Date
Tier / Score
Note
Feb 2024
GPT-4 at 84.5% (BFCL v1 overall AST)
Original Gorilla release.
Aug 2024
BFCL v2 launches, GPT-4o around 80.6%
Multi-step and live categories added; harder benchmark.
Dec 2024
BFCL v3 launches, Claude 3.5 Sonnet at 82.3%
Relevance detection and Multi-Turn-Long categories added.
May 2026
Sonnet 4.7 near 87%, GPT-5 around 86%, Gemini 3 around 84%
Top of the leaderboard; captured from BFCL v3 leaderboard.
III

Reading BFCL Scores

Always inspect the per-category breakdown, not just the overall. A model that scores 85% overall by acing Simple and tanking Multi-Turn is a different proposition for production than one that scores 80% by being merely competent across categories. The overall number averages over varying difficulty; the category-level table is where you find which failure mode dominates.

Tool-use benchmarks comparedTau-Bench retail and airlineOpenAI Agents SDK on BFCL
Reader Questions
Q.01What is BFCL?+
BFCL (Berkeley Function-Calling Leaderboard) is the standard benchmark for evaluating function-calling, the LLM capability where the model decides which tool to invoke and with what arguments. Maintained by the Gorilla team at UC Berkeley, v3 expanded the original v1 (simple AST-grading) into multi-step, parallel, and live categories.
Q.02What are the BFCL v3 categories?+
BFCL v3 evaluates across roughly seven categories: Simple (one function, AST match), Multiple (pick one from a candidate set), Parallel (call several functions in one turn), Parallel-Multiple (combine the previous two), Multi-Turn (track tool state across turns), Live (real APIs invoked, not just AST graded), and Relevance Detection (refuse if no tool is relevant). The headline number is the unweighted accuracy across all categories.
Q.03Why does function-calling need a dedicated benchmark?+
Function-calling failure modes do not show up in MMLU or HumanEval. A model can write correct Python but invent function signatures, hallucinate parameter names, fail to handle ambiguous user intent, or call a function when the user did not ask for a tool. BFCL surfaces these failure modes directly and is the only benchmark most function-calling teams trust as a regression signal.
Q.04How do frontier models score in 2026?+
Sonnet 4.7 sits at the top of the BFCL v3 leaderboard near 87% overall, with GPT-5 close behind. Mid-tier closed models cluster in the high 70s. Strong open-weight tool-tuned models (Hammer, Functionary-v3, xLAM) reach the low 70s. Smaller open-weight models score below 50, especially on multi-turn and relevance detection categories.
Q.05Is BFCL gameable?+
Partially. AST grading rewards exact match on function names and arguments, which means models trained on similar synthetic data (a common practice) score higher on Simple and Multiple categories than their underlying tool-use capability deserves. Live and Multi-Turn categories are more robust because they require correct tool execution and state tracking, not just lexical match.

Sources

  1. [1] BFCL leaderboard: gorilla.cs.berkeley.edu/leaderboard
  2. [2] BFCL blog and v3 announcement: gorilla.cs.berkeley.edu/blogs/13_bfcl_v3
  3. [3] Gorilla repository: github.com/ShishirPatil/gorilla
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.