Abstract

What750 instruction-following tasks against 9 simulated apps and 457 APIs, Python interpreter as action space.

WhoTrivedi, Khot, Acharya, Hartshorn, Roth, Sabharwal (Stony Brook + Allen AI + others), ACL 2024 Best Paper.

2026 TierFrontier around 65% TGC, 33% SGC; not saturated.

Section II.iv Agent Benchmarks|Last verified April 2026

AppWorld Benchmark: 750 Tasks Across 9 Apps, Frontier Below 50%

Code-as-action across nine simulated apps. The benchmark that separates task completion from scenario completion.

Construction

AppWorld provides a fully controllable Python simulation of 9 apps that a typical phone user might touch in a day. The agent receives an instruction such as "split tonight's dinner bill on Splitwise with the three people I texted today about dinner", then writes Python that calls the simulated APIs to complete it. 457 APIs are available across the apps. The simulator records every state change, which makes scoring deterministic.

The dataset ships 750 supervisor-curated tasks grouped into 244 scenarios. A scenario is a chain of related tasks that share state (the same supervisor wants several things done in sequence). TGC is per-task; SGC requires the whole scenario to finish without breaking earlier state.

SOTA Progression

Date

Tier / Score

Note

Jul 2024

GPT-4 at 48.7% TGC, 21.0% SGC

Original AppWorld paper, ACL 2024 best paper.

Nov 2024

Claude 3.5 Sonnet at 54.2% TGC

Anthropic-reported with native tool use.

Mar 2025

o1 at 60.1% TGC, 28.9% SGC

First reasoning-tuned model run.

Apr 2026

Frontier around 65% TGC, 33% SGC

Captured from public leaderboard, methodology disclosed.

III

Where Models Fail

The failure analysis in the original paper found three dominant error modes. First, API confusion: the agent invents methods that do not exist or uses the wrong app's API. Second, state-tracking drift: across a long scenario the agent loses track of what it has and has not done. Third, hallucinated entities: the agent claims to have sent a message to a contact that does not exist. All three are exacerbated when the scenario has more than 5 tasks.

When to Pick AppWorld

Pick AppWorld for evaluating tool-using assistants where code-as-action is the deployment model. Pick WebArena or OSWorld for browser or desktop agents. Pick TauBench for dialogue-driven customer-support agents. The benchmarks measure different stacks and a high score on one does not imply a high score on the others.

WebArena methodology →OSWorld for desktop agents →Tool-use benchmarks compared →

Reader Questions

Q.01What does AppWorld measure?+

AppWorld is a controllable simulation of 9 everyday day-to-day apps (Amazon, Gmail, Phone, Spotify, SimpleNote, Splitwise, Todoist, Venmo, FileSystem) with a Python interpreter as the action space. The agent reads a natural-language instruction, then writes and executes Python code that calls the app APIs to satisfy the task. Tasks span 457 APIs and require coordination across multiple apps.

Q.02What are the headline numbers?+

On the original AppWorld test split, GPT-4 scored 48.7% Task Goal Completion (TGC) and 21.0% Scenario Goal Completion (SGC). Open-weight models trailed below 25% TGC. By May 2026, frontier reasoning models (o3, Claude Sonnet 4.7) reach roughly 65% TGC; SGC remains well below TGC because scenarios chain multiple tasks.

Q.03Why does TGC outpace SGC?+

TGC measures whether a single task completes correctly. SGC (Scenario Goal Completion) measures whether all tasks within a multi-step scenario complete. SGC is the more demanding metric and is the one that maps to real assistant workflows. AppWorld is unusual in reporting both, which prevents the headline from over-stating capability.

Q.04How is AppWorld different from WebArena?+

WebArena gives the agent a browser. AppWorld gives the agent a Python interpreter with API access. AppWorld is closer to how tool-using agents (function calling, code interpreter) actually operate in production. WebArena is closer to how a human-mimicking browser-use agent operates. The two benchmarks measure overlapping but distinct stacks.

Q.05Is AppWorld saturated?+

No. SGC remains below 35% even for frontier reasoning models, which leaves substantial headroom. The simulator is open-source on the AppWorld GitHub, so contamination from training data is limited and the benchmark can be extended with new scenarios.

Sources

[1] Trivedi et al. (2024): arxiv.org/abs/2407.18901
[2] AppWorld project: appworld.dev
[3] AppWorld repository: github.com/StonyBrookNLP/appworld