AppWorld Benchmark: 750 Tasks Across 9 Apps, Frontier Below 50%
Code-as-action across nine simulated apps. The benchmark that separates task completion from scenario completion.
Construction
AppWorld provides a fully controllable Python simulation of 9 apps that a typical phone user might touch in a day. The agent receives an instruction such as "split tonight's dinner bill on Splitwise with the three people I texted today about dinner", then writes Python that calls the simulated APIs to complete it. 457 APIs are available across the apps. The simulator records every state change, which makes scoring deterministic.
The dataset ships 750 supervisor-curated tasks grouped into 244 scenarios. A scenario is a chain of related tasks that share state (the same supervisor wants several things done in sequence). TGC is per-task; SGC requires the whole scenario to finish without breaking earlier state.
SOTA Progression
Where Models Fail
The failure analysis in the original paper found three dominant error modes. First, API confusion: the agent invents methods that do not exist or uses the wrong app's API. Second, state-tracking drift: across a long scenario the agent loses track of what it has and has not done. Third, hallucinated entities: the agent claims to have sent a message to a contact that does not exist. All three are exacerbated when the scenario has more than 5 tasks.
When to Pick AppWorld
Pick AppWorld for evaluating tool-using assistants where code-as-action is the deployment model. Pick WebArena or OSWorld for browser or desktop agents. Pick TauBench for dialogue-driven customer-support agents. The benchmarks measure different stacks and a high score on one does not imply a high score on the others.
Q.01What does AppWorld measure?+
Q.02What are the headline numbers?+
Q.03Why does TGC outpace SGC?+
Q.04How is AppWorld different from WebArena?+
Q.05Is AppWorld saturated?+
Sources
- [1] Trivedi et al. (2024): arxiv.org/abs/2407.18901
- [2] AppWorld project: appworld.dev
- [3] AppWorld repository: github.com/StonyBrookNLP/appworld