Abstract

What214 realistic web-browsing tasks, 5 to 90 minutes of human effort each.

WhoYoran, Amouyal, Malaviya, Bogin, Press, Berant (Tel Aviv University, EMNLP 2024).

2026 TierFrontier 38 to 42% with strong browser harnesses.

Projectassistantbench.github.io

Section II.v Agent Benchmarks|Last verified April 2026

AssistantBench: 214 Realistic Web Tasks, GPT-4 Scores 26%

The benchmark designed around what users actually ask assistants to do: multi-step web research with a verifiable answer.

Construction

The Tel Aviv University team built AssistantBench by sourcing tasks from three streams: (1) MTurk worker submissions of recent tasks they actually wanted help with, (2) university administrator workflows (course catalogue lookups, scholarship eligibility checks), and (3) author-curated tasks designed to stress specific failure modes. Each task ships with a gold answer extracted by humans and a verification protocol.

The benchmark deliberately excludes tasks that are answerable from training data alone. Closed-book GPT-4 hits only 11.1% on the test set, which the authors use to argue that browsing capability, not memory, is what is being measured.

SOTA Progression

Date

Tier / Score

Note

Jul 2024

GPT-4 + SeePlanAct at 25.2%

Original Yoran et al. paper baseline.

Oct 2024

Claude 3.5 Sonnet + Computer Use at 32.8%

Anthropic Computer Use system card.

Mar 2025

OpenAI Operator at 35.4%

Reported in Operator launch blog.

Apr 2026

Frontier around 38 to 42%

Captured from public reports; harness-dependent.

III

Limitations

Live-web evaluation has a reproducibility cost. A 2024 task that asked for the rent of a specific listing has a different gold answer in 2026 because the listing changed. The AssistantBench team partly addresses this by snapshotting page contents at task creation time, but agents in the live browser see the current page, not the snapshot. This means scores from 2024 papers and 2026 papers are not strictly comparable for time-sensitive tasks.

GAIA: the other realistic assistant benchmark →Mind2Web for action prediction →Browser-agent benchmarks compared →

Reader Questions

Q.01What does AssistantBench test?+

AssistantBench is 214 web-browsing tasks designed to take a human between 5 and 90 minutes. Examples include 'find the average price of a 2-bedroom rental in three named neighbourhoods' or 'list the speakers at the most recent NeurIPS keynote and their affiliations'. Tasks require multi-page navigation, cross-source synthesis, and structured-output extraction.

Q.02What is the headline score?+

GPT-4 with the SeePlanAct agent (from the paper) scored 25.2% accuracy. SeeAct (a baseline web agent from the same group) scored 14.4%. Closed-book GPT-4 (no browsing) scored 11.1%, which establishes that the questions are not solvable from memory alone. Frontier 2026 models with stronger browser harnesses (Claude Computer Use, OpenAI Operator) reach the high 30s.

Q.03How does AssistantBench score answers?+

Most answers are short structured outputs (numbers, lists, JSON-like records). The scoring function is a programmatic match with normalisation for units, ordering, and minor formatting. Tasks where the answer is a free-text summary are excluded from the scored split because LLM-as-judge introduces variance the paper authors wanted to avoid.

Q.04Why is GPT-4 stuck below 30%?+

The failure analysis identified three dominant causes: (1) the agent stops too early and submits a partial answer; (2) the agent fails to verify an answer against a second source and accepts the first plausible value; (3) the agent's JSON extraction breaks on complex web layouts. None of these are model capability ceilings, they are harness limitations, which is why scores rise quickly when better harnesses ship.

Q.05How is this different from WebArena and Mind2Web?+

WebArena uses controlled simulated environments (a self-hosted Reddit, a self-hosted Gitea). Mind2Web uses static page snapshots with offline action prediction. AssistantBench uses the live public web. The cost is that the web changes under the agent and reproducibility weakens over time; the benefit is that AssistantBench measures what users actually want.

Sources

[1] Yoran et al. (2024): arxiv.org/abs/2407.15711
[2] AssistantBench project: assistantbench.github.io
[3] Anthropic Computer Use system card: anthropic.com/news/3-5-models-and-computer-use