Abstract

Headline trioWebArena (DOM agents), Visual WebArena (screenshot agents), OSWorld (whole-desktop)

FoundationalMind2Web, WebShop (now embedded in WebArena and AgentBench)

2026 frontierWebArena 55-65%; Visual WebArena 45-55%; OSWorld 35-45%

Most harness-sensitiveWebArena (20-30 point gaps based on scaffold)

Section II.xiii · Benchmark Comparison|Last verified April 2026

Browser-Agent Benchmarks: The 2026 Selection Guide

The browser-agent benchmark landscape has settled around three benchmarks that match the three deployment shapes browser agents take in 2026. WebArena for DOM agents, Visual WebArena for screenshot agents, OSWorld for whole-desktop agents. Pick by deployment shape; quote with the harness; expect 20-30 point swings between configurations.

Three deployment shapes, three benchmarks

Browser agents in 2026 come in three structurally different deployment shapes, and each shape has a primary benchmark that matches it. Shape one is the DOM-consuming agent: the agent receives parsed HTML or a structured DOM representation, identifies elements by selector or attribute, and acts through DOM operations like click, type, and navigate. Web automation libraries (Playwright, Selenium) and the LangGraph + BrowserGym style fit this shape. WebArena is the natural benchmark.

Shape two is the screenshot-consuming agent: the agent receives rendered page screenshots, grounds elements visually, and acts through coordinate-level mouse and keyboard inputs. Anthropic Computer Use, OpenAI Operator, Google Project Mariner, and similar consumer products fit this shape. Visual WebArena is the natural benchmark; the same four sites as WebArena, with screenshot inputs and visual grounding required.

Shape three is the whole-desktop agent: the agent operates across multiple desktop applications, of which a browser is one. The agent must switch between browser, office software, file manager, terminal, and other applications, often within a single task. OSWorld is the natural benchmark; 369 tasks spanning six task areas including browser-based work alongside office, file management, code editing, image editing, and terminal tasks.

Benchmark-by-benchmark comparison

The full picture of browser-agent benchmarks in 2026 includes the headline trio plus several historical or specialised benchmarks. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.

Benchmark

What it measures

2026 frontier

Note

Recommend

WebArena

812 tasks across 4 self-hosted web apps; DOM-based actions

55-65% (strong scaffold)

Browser-only; DOM actions only

Yes (DOM-consuming agents)

Visual WebArena

910 multimodal tasks across same 4 apps; screenshot-based

45-55% (strong scaffold)

Trails WebArena by ~10pts; same site set

Yes (screenshot-consuming agents)

OSWorld

369 real-OS tasks; browser is one of 6 task areas

35-45% (with strong scaffold)

Hardest in family; Linux-core

Yes (whole-desktop agents)

Mind2Web

Pre-recorded page sequences; trajectory classification

Variable; less commonly cited

Pre-recorded vs live; less production-realistic

Research use only

WebShop

Simulated retailer with cart and search

Saturated for frontier models

Embedded in WebArena and AgentBench now

Use embedded versions

BrowserGym

Library + benchmark suite that includes WebArena, MiniWoB++, more

Variable per benchmark

Not a benchmark itself; meta-framework

Use as harness, quote underlying benchmarks

Use-case-by-use-case selection guide

The right benchmark depends on what your browser agent does. The table below gives the recommended primary and secondary benchmark for each common browser-agent use case.

Use case

Primary benchmark

Secondary

DOM-consuming browser agent (parses HTML)

WebArena

OSWorld for breadth

Screenshot-consuming browser agent (vision-grounded)

Visual WebArena

OSWorld

Whole-desktop computer-use agent

OSWorld

Visual WebArena

E-commerce shopping or transactional workflows

WebArena (OneStopShop env)

Tau-Bench retail

Research / trajectory analysis

Mind2Web

WebArena

GitLab / GitHub / engineering workflow agents

WebArena (GitLab env)

Terminal-Bench

DOM vs screenshot: the modality choice

The DOM-versus-screenshot choice is the single most important determinant of which browser benchmark to quote. WebArena and Visual WebArena use the same four sites and the same success functions; they differ only in the modality the agent consumes. The 10-15 point gap between the two reflects the harder modality of screenshot grounding. If your agent works from the DOM, the DOM benchmark is the right comparison; if it works from screenshots, the screenshot benchmark is the right comparison.

The honest read of vendor browser-agent claims requires checking the modality. Anthropic Computer Use is screenshot-grounded; its WebArena claims should be read in the Visual WebArena framing, not the original WebArena framing. The same applies to OpenAI Operator. Conversely, BrowserGym + LangGraph configurations are typically DOM-based; their WebArena claims are correctly compared against the original WebArena leaderboard.

OSWorld's broader scope and what it adds

OSWorld's value as a browser-agent benchmark comes from its breadth. The 369 tasks include genuine browser work (Firefox-based search, navigation, form filling) alongside office, file management, code editor, image editor, and terminal tasks. A browser-only agent that scores well on WebArena might struggle on OSWorld's browser subset because OSWorld's browser tasks often span multiple applications: download a file in Firefox, open it in LibreOffice Calc, paste a chart into Impress.

For agents whose deployment context is "browser plus other apps" (any consumer-grade computer-use product), OSWorld is the more realistic comparison. For agents whose deployment context is "browser only" (a Web automation tool), WebArena is the more focused comparison. Many production agents fit somewhere in between, which is why we recommend quoting both for serious computer-use agent claims.

Harness sensitivity is severe

Browser-agent scores are the most harness-sensitive numbers in the agent-benchmark family. The same model in a minimal harness and a strong agentic scaffold can differ by 20-30 points on WebArena. The strongest scaffolds add: planning loops that decompose tasks before acting, set-of-marks (SoM) annotation that overlays numbered boxes on UI elements, screenshot grounding that supplements DOM access with visual context, retrieval over the rendered DOM that lets the model focus on relevant elements, and self-correction loops that re-evaluate after each action.

The implication is that bare model-vs-model browser benchmark comparisons are essentially meaningless without harness disclosure. A 35 percent WebArena score and a 60 percent WebArena score on the same underlying model can both be honest; they reflect different scaffolding investments. The honest pattern when citing a browser-agent number is "[model] in [harness] reaches X percent on [benchmark], according to [source]". Vague claims that hide the harness hide the most important variable.

What the community is converging on

Three patterns appear in published 2026 browser-agent claims. First, vendor consumer products (Anthropic, OpenAI, Google) report Visual WebArena and OSWorld as primary numbers because their products are screenshot-grounded computer-use agents. Second, open-source automation tools and frameworks (LangGraph + BrowserGym, others) report WebArena as primary because they are DOM-consuming. Third, research papers comparing scaffold designs typically report all three (WebArena, Visual WebArena, OSWorld) to give a complete picture.

The community has converged on the headline trio as the appropriate benchmark portfolio for browser-agent claims. Mind2Web is now historical; WebShop is embedded; new benchmarks (BrowserBench, WebShop-2, others) are emerging but have not yet displaced the headline three. Expect this configuration to remain stable through 2027; the more interesting evolution is in scaffold design rather than benchmark design.

Editor's verdictQuote WebArena for DOM-consuming browser agents, Visual WebArena for screenshot-consuming agents, OSWorld for whole-desktop agents. Disclose the harness; treat 30-point gaps without harness detail as suspicious. Skip Mind2Web for current frontier comparisons.

Reader Questions

Q.01Which browser-agent benchmark should I quote in 2026?+

If your agent consumes the DOM (parsed HTML), quote WebArena. If your agent consumes screenshots (vision-language model), quote Visual WebArena. If your agent operates across multiple desktop applications including a browser, quote OSWorld. The three benchmarks cover the three deployment shapes browser agents take in 2026; the right choice depends on which shape matches your production agent.

Q.02What's the difference between WebArena and OSWorld?+

Scope and action space. WebArena is browser-only across four self-hosted web applications with DOM-level actions. OSWorld is whole-desktop across six task areas with mouse-and-keyboard actions on a real Linux VM. WebArena is a cleaner test of browser-specific capability; OSWorld is a more realistic test of general computer-use capability that includes browser tasks alongside office, file, code, and image work. Frontier models score 55-65 percent on WebArena and 35-45 percent on OSWorld in 2026; OSWorld is harder.

Q.03Why is Visual WebArena harder than WebArena?+

Same backend, different input modality. WebArena gives the agent parsed HTML and lets it act through DOM operations. Visual WebArena gives the agent screenshots and requires it to ground actions visually (click at coordinates, type into the focused field). Frontier multimodal models score 10-15 points lower on VWA than on WebArena; the gap reflects the harder modality. If your production agent consumes screenshots (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), VWA is the closer benchmark.

Q.04Are Mind2Web and WebShop still relevant?+

Limited. Mind2Web pioneered browser-agent benchmarking with pre-recorded page sequences and remains useful for trajectory-classification research, but it has been largely superseded by WebArena and OSWorld for end-to-end agent evaluation. WebShop is now embedded as one environment in WebArena (and AgentBench); the standalone WebShop benchmark is rarely cited for current frontier comparisons. Both are foundational, neither is the right 2026 headline.

Q.05What about commercial browser agents (Anthropic Computer Use, OpenAI Operator, Google Project Mariner)?+

These are products built on top of vendor-specific scaffolding that runs against benchmarks like OSWorld and Visual WebArena. Each vendor publishes their own claimed scores; Anthropic and OpenAI both report OSWorld numbers in the high 30s to mid 40s with their respective Computer Use products. Vendor numbers should be read as 'product plus model plus scaffolding' claims; community submissions using the same model with open scaffolding typically score lower. The vendor advantage on browser benchmarks comes from the scaffolding integration as much as from the model.

Q.06How harness-sensitive are browser benchmarks?+

Very. Browser benchmark scores are the most harness-sensitive numbers in the agent-benchmark family. The same model in a minimal harness and a strong agentic scaffold can differ by 20-30 points on WebArena. The strongest scaffolds add: planning loops, set-of-marks UI annotation, screenshot grounding, retrieval over the rendered DOM, and self-correction loops. Always quote the harness when quoting a browser-agent number; bare 'model X scores Y on WebArena' claims hide the most important variable.

WebArena Methodology →OSWorld Benchmark →Agent Benchmarks Overview →AgentBench →Tau-Bench →Coding-Agent Benchmarks Compared →What Benchmarks Miss →

Sources

[1] WebArena project site. webarena.dev. Accessed May 2026.
[2] Visual WebArena. arXiv:2401.13649.
[3] OSWorld project site. os-world.github.io.
[4] BrowserGym library. github.com/ServiceNow/BrowserGym.
[5] Mind2Web project. osu-nlp-group.github.io/Mind2Web.