AssistantBench: 214 Realistic Web Tasks, GPT-4 Scores 26%
The benchmark designed around what users actually ask assistants to do: multi-step web research with a verifiable answer.
Construction
The Tel Aviv University team built AssistantBench by sourcing tasks from three streams: (1) MTurk worker submissions of recent tasks they actually wanted help with, (2) university administrator workflows (course catalogue lookups, scholarship eligibility checks), and (3) author-curated tasks designed to stress specific failure modes. Each task ships with a gold answer extracted by humans and a verification protocol.
The benchmark deliberately excludes tasks that are answerable from training data alone. Closed-book GPT-4 hits only 11.1% on the test set, which the authors use to argue that browsing capability, not memory, is what is being measured.
SOTA Progression
Limitations
Live-web evaluation has a reproducibility cost. A 2024 task that asked for the rent of a specific listing has a different gold answer in 2026 because the listing changed. The AssistantBench team partly addresses this by snapshotting page contents at task creation time, but agents in the live browser see the current page, not the snapshot. This means scores from 2024 papers and 2026 papers are not strictly comparable for time-sensitive tasks.
Q.01What does AssistantBench test?+
Q.02What is the headline score?+
Q.03How does AssistantBench score answers?+
Q.04Why is GPT-4 stuck below 30%?+
Q.05How is this different from WebArena and Mind2Web?+
Sources
- [1] Yoran et al. (2024): arxiv.org/abs/2407.15711
- [2] AssistantBench project: assistantbench.github.io
- [3] Anthropic Computer Use system card: anthropic.com/news/3-5-models-and-computer-use