Mind2Web: 2,350 Web Tasks, 137 Sites, GPT-4 at 11.2% Step Success
The offline counterpart to WebArena. Snapshot-based evaluation across the public web.
Construction
The Ohio State NLP group recruited annotators to demonstrate everyday web tasks: book a flight, find a recipe, sign up for a newsletter, navigate to a specific event page. Each demonstration captures the high-level instruction, the page DOM at each step, the screenshot, and the target action (click, type, select). The dataset spans 137 distinct websites across 31 domains (travel, shopping, services, entertainment, government, education, news).
Evaluation is offline. The agent reads the current page snapshot and predicts the next action. Three metrics: Element Accuracy (did the agent identify the right DOM element?), Operation F1 (did the agent choose the right action type?), and Step Success Rate (did the full action exactly match the gold?). Step Success is the headline.
SOTA Progression
When to Use Mind2Web
Mind2Web is the right benchmark when the question is "does this model pick the right action at each step?" without confounds from harness, browser tooling, or live web drift. It is wrong when the question is end-to-end task completion. Pair Mind2Web (step-level diagnostics) with WebArena or AssistantBench (end-to-end signal).
Q.01What is Mind2Web?+
Q.02Why is the GPT-4 step-success rate so low?+
Q.03What is MindAct?+
Q.04Has Mind2Web been replaced by WebArena?+
Q.05Is the live web's drift a problem for Mind2Web?+
Sources
- [1] Deng et al. (2023): arxiv.org/abs/2306.06070
- [2] Mind2Web project: osu-nlp-group.github.io/Mind2Web
- [3] SeeAct paper (2024): arxiv.org/abs/2401.01614