Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What750 instruction-following tasks against 9 simulated apps and 457 APIs, Python interpreter as action space.
WhoTrivedi, Khot, Acharya, Hartshorn, Roth, Sabharwal (Stony Brook + Allen AI + others), ACL 2024 Best Paper.
2026 TierFrontier around 65% TGC, 33% SGC; not saturated.
Projectappworld.dev
Section II.iv Agent Benchmarks|Last verified April 2026

AppWorld Benchmark: 750 Tasks Across 9 Apps, Frontier Below 50%

Code-as-action across nine simulated apps. The benchmark that separates task completion from scenario completion.

I

Construction

AppWorld provides a fully controllable Python simulation of 9 apps that a typical phone user might touch in a day. The agent receives an instruction such as "split tonight's dinner bill on Splitwise with the three people I texted today about dinner", then writes Python that calls the simulated APIs to complete it. 457 APIs are available across the apps. The simulator records every state change, which makes scoring deterministic.

The dataset ships 750 supervisor-curated tasks grouped into 244 scenarios. A scenario is a chain of related tasks that share state (the same supervisor wants several things done in sequence). TGC is per-task; SGC requires the whole scenario to finish without breaking earlier state.

II

SOTA Progression

Date
Tier / Score
Note
Jul 2024
GPT-4 at 48.7% TGC, 21.0% SGC
Original AppWorld paper, ACL 2024 best paper.
Nov 2024
Claude 3.5 Sonnet at 54.2% TGC
Anthropic-reported with native tool use.
Mar 2025
o1 at 60.1% TGC, 28.9% SGC
First reasoning-tuned model run.
Apr 2026
Frontier around 65% TGC, 33% SGC
Captured from public leaderboard, methodology disclosed.
III

Where Models Fail

The failure analysis in the original paper found three dominant error modes. First, API confusion: the agent invents methods that do not exist or uses the wrong app's API. Second, state-tracking drift: across a long scenario the agent loses track of what it has and has not done. Third, hallucinated entities: the agent claims to have sent a message to a contact that does not exist. All three are exacerbated when the scenario has more than 5 tasks.

IV

When to Pick AppWorld

Pick AppWorld for evaluating tool-using assistants where code-as-action is the deployment model. Pick WebArena or OSWorld for browser or desktop agents. Pick TauBench for dialogue-driven customer-support agents. The benchmarks measure different stacks and a high score on one does not imply a high score on the others.

WebArena methodologyOSWorld for desktop agentsTool-use benchmarks compared
Reader Questions
Q.01What does AppWorld measure?+
AppWorld is a controllable simulation of 9 everyday day-to-day apps (Amazon, Gmail, Phone, Spotify, SimpleNote, Splitwise, Todoist, Venmo, FileSystem) with a Python interpreter as the action space. The agent reads a natural-language instruction, then writes and executes Python code that calls the app APIs to satisfy the task. Tasks span 457 APIs and require coordination across multiple apps.
Q.02What are the headline numbers?+
On the original AppWorld test split, GPT-4 scored 48.7% Task Goal Completion (TGC) and 21.0% Scenario Goal Completion (SGC). Open-weight models trailed below 25% TGC. By May 2026, frontier reasoning models (o3, Claude Sonnet 4.7) reach roughly 65% TGC; SGC remains well below TGC because scenarios chain multiple tasks.
Q.03Why does TGC outpace SGC?+
TGC measures whether a single task completes correctly. SGC (Scenario Goal Completion) measures whether all tasks within a multi-step scenario complete. SGC is the more demanding metric and is the one that maps to real assistant workflows. AppWorld is unusual in reporting both, which prevents the headline from over-stating capability.
Q.04How is AppWorld different from WebArena?+
WebArena gives the agent a browser. AppWorld gives the agent a Python interpreter with API access. AppWorld is closer to how tool-using agents (function calling, code interpreter) actually operate in production. WebArena is closer to how a human-mimicking browser-use agent operates. The two benchmarks measure overlapping but distinct stacks.
Q.05Is AppWorld saturated?+
No. SGC remains below 35% even for frontier reasoning models, which leaves substantial headroom. The simulator is open-source on the AppWorld GitHub, so contamination from training data is limited and the benchmark can be extended with new scenarios.

Sources

  1. [1] Trivedi et al. (2024): arxiv.org/abs/2407.18901
  2. [2] AppWorld project: appworld.dev
  3. [3] AppWorld repository: github.com/StonyBrookNLP/appworld
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.