Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What2,350 task demonstrations on 137 real public websites across 31 domains; offline action prediction.
WhoDeng, Gu, Zheng, Chen, Stevens, Wang, Sun, Su (Ohio State University, NeurIPS 2023).
2026 TierFrontier above 35% step success on Mind2Web-Live; still discriminating.
Projectosu-nlp-group.github.io/Mind2Web
Section II.vii Agent Benchmarks|Last verified April 2026

Mind2Web: 2,350 Web Tasks, 137 Sites, GPT-4 at 11.2% Step Success

The offline counterpart to WebArena. Snapshot-based evaluation across the public web.

I

Construction

The Ohio State NLP group recruited annotators to demonstrate everyday web tasks: book a flight, find a recipe, sign up for a newsletter, navigate to a specific event page. Each demonstration captures the high-level instruction, the page DOM at each step, the screenshot, and the target action (click, type, select). The dataset spans 137 distinct websites across 31 domains (travel, shopping, services, entertainment, government, education, news).

Evaluation is offline. The agent reads the current page snapshot and predicts the next action. Three metrics: Element Accuracy (did the agent identify the right DOM element?), Operation F1 (did the agent choose the right action type?), and Step Success Rate (did the full action exactly match the gold?). Step Success is the headline.

II

SOTA Progression

Date
Tier / Score
Note
Jun 2023
GPT-4 + MindAct at 11.2% step success
Original Deng et al. paper, NeurIPS 2023 Datasets and Benchmarks track.
Jan 2024
GPT-4 + SeeAct (HTML + screenshots) at 20.3%
Multimodal augmentation lifts step success.
Aug 2024
GPT-4o + grounded vision at 28.4%
Multimodal frontier models with image-grounded action prediction.
Apr 2026
Frontier above 35% step success on Mind2Web-Live
Live re-snapshot tracking; captured from public reports.
III

When to Use Mind2Web

Mind2Web is the right benchmark when the question is "does this model pick the right action at each step?" without confounds from harness, browser tooling, or live web drift. It is wrong when the question is end-to-end task completion. Pair Mind2Web (step-level diagnostics) with WebArena or AssistantBench (end-to-end signal).

WebArena for end-to-end web tasksAssistantBench for live-web researchBrowser-agent benchmarks compared
Reader Questions
Q.01What is Mind2Web?+
Mind2Web is a dataset of 2,350 task demonstrations collected on 137 real public websites spanning 31 domains. Each demonstration is a sequence of (page snapshot, target action) pairs. The benchmark task is to predict the correct next action given the page snapshot and the high-level instruction. Mind2Web is offline: the agent does not interact with a live browser, it scores against a fixed gold trajectory.
Q.02Why is the GPT-4 step-success rate so low?+
The original paper reported 11.2% step success rate for GPT-4 with the MindAct three-stage prompting pipeline. Step success means the predicted action exactly matches the gold action at that step. Real pages have hundreds of candidate elements; exact-match is harsh. The element-grounding accuracy (does the agent identify the correct DOM element?) is much higher, around 53%.
Q.03What is MindAct?+
MindAct is the agent architecture introduced alongside the Mind2Web dataset. It uses a small model to rank candidate elements on the page, then a large model to pick the action conditional on the top-k candidates. The two-stage design works around the context-length limit when scraping full HTML, and most subsequent Mind2Web papers use a variant of it.
Q.04Has Mind2Web been replaced by WebArena?+
Not replaced. Complementary. Mind2Web measures action prediction on snapshots of the live web (high realism, low controllability). WebArena measures end-to-end task completion in self-hosted environments (low realism on app diversity, high controllability and reproducibility). Both are useful and most browser-agent papers report on both.
Q.05Is the live web's drift a problem for Mind2Web?+
Less than for live-browser benchmarks. Mind2Web evaluates against static snapshots captured at collection time, so the gold answers do not change. The downside is the snapshots age: agents that learn 2023 layouts may not generalise to 2026 pages of the same sites. The Mind2Web-Live extension addresses this with periodic re-snapshotting.

Sources

  1. [1] Deng et al. (2023): arxiv.org/abs/2306.06070
  2. [2] Mind2Web project: osu-nlp-group.github.io/Mind2Web
  3. [3] SeeAct paper (2024): arxiv.org/abs/2401.01614
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.