Browser-Agent Benchmarks: The 2026 Selection Guide
The browser-agent benchmark landscape has settled around three benchmarks that match the three deployment shapes browser agents take in 2026. WebArena for DOM agents, Visual WebArena for screenshot agents, OSWorld for whole-desktop agents. Pick by deployment shape; quote with the harness; expect 20-30 point swings between configurations.
Three deployment shapes, three benchmarks
Browser agents in 2026 come in three structurally different deployment shapes, and each shape has a primary benchmark that matches it. Shape one is the DOM-consuming agent: the agent receives parsed HTML or a structured DOM representation, identifies elements by selector or attribute, and acts through DOM operations like click, type, and navigate. Web automation libraries (Playwright, Selenium) and the LangGraph + BrowserGym style fit this shape. WebArena is the natural benchmark.
Shape two is the screenshot-consuming agent: the agent receives rendered page screenshots, grounds elements visually, and acts through coordinate-level mouse and keyboard inputs. Anthropic Computer Use, OpenAI Operator, Google Project Mariner, and similar consumer products fit this shape. Visual WebArena is the natural benchmark; the same four sites as WebArena, with screenshot inputs and visual grounding required.
Shape three is the whole-desktop agent: the agent operates across multiple desktop applications, of which a browser is one. The agent must switch between browser, office software, file manager, terminal, and other applications, often within a single task. OSWorld is the natural benchmark; 369 tasks spanning six task areas including browser-based work alongside office, file management, code editing, image editing, and terminal tasks.
Benchmark-by-benchmark comparison
The full picture of browser-agent benchmarks in 2026 includes the headline trio plus several historical or specialised benchmarks. The summary below lays out what each measures, the current frontier, the strengths, the weaknesses, and the recommendation.
Use-case-by-use-case selection guide
The right benchmark depends on what your browser agent does. The table below gives the recommended primary and secondary benchmark for each common browser-agent use case.
DOM vs screenshot: the modality choice
The DOM-versus-screenshot choice is the single most important determinant of which browser benchmark to quote. WebArena and Visual WebArena use the same four sites and the same success functions; they differ only in the modality the agent consumes. The 10-15 point gap between the two reflects the harder modality of screenshot grounding. If your agent works from the DOM, the DOM benchmark is the right comparison; if it works from screenshots, the screenshot benchmark is the right comparison.
The honest read of vendor browser-agent claims requires checking the modality. Anthropic Computer Use is screenshot-grounded; its WebArena claims should be read in the Visual WebArena framing, not the original WebArena framing. The same applies to OpenAI Operator. Conversely, BrowserGym + LangGraph configurations are typically DOM-based; their WebArena claims are correctly compared against the original WebArena leaderboard.
OSWorld's broader scope and what it adds
OSWorld's value as a browser-agent benchmark comes from its breadth. The 369 tasks include genuine browser work (Firefox-based search, navigation, form filling) alongside office, file management, code editor, image editor, and terminal tasks. A browser-only agent that scores well on WebArena might struggle on OSWorld's browser subset because OSWorld's browser tasks often span multiple applications: download a file in Firefox, open it in LibreOffice Calc, paste a chart into Impress.
For agents whose deployment context is "browser plus other apps" (any consumer-grade computer-use product), OSWorld is the more realistic comparison. For agents whose deployment context is "browser only" (a Web automation tool), WebArena is the more focused comparison. Many production agents fit somewhere in between, which is why we recommend quoting both for serious computer-use agent claims.
Harness sensitivity is severe
Browser-agent scores are the most harness-sensitive numbers in the agent-benchmark family. The same model in a minimal harness and a strong agentic scaffold can differ by 20-30 points on WebArena. The strongest scaffolds add: planning loops that decompose tasks before acting, set-of-marks (SoM) annotation that overlays numbered boxes on UI elements, screenshot grounding that supplements DOM access with visual context, retrieval over the rendered DOM that lets the model focus on relevant elements, and self-correction loops that re-evaluate after each action.
The implication is that bare model-vs-model browser benchmark comparisons are essentially meaningless without harness disclosure. A 35 percent WebArena score and a 60 percent WebArena score on the same underlying model can both be honest; they reflect different scaffolding investments. The honest pattern when citing a browser-agent number is "[model] in [harness] reaches X percent on [benchmark], according to [source]". Vague claims that hide the harness hide the most important variable.
What the community is converging on
Three patterns appear in published 2026 browser-agent claims. First, vendor consumer products (Anthropic, OpenAI, Google) report Visual WebArena and OSWorld as primary numbers because their products are screenshot-grounded computer-use agents. Second, open-source automation tools and frameworks (LangGraph + BrowserGym, others) report WebArena as primary because they are DOM-consuming. Third, research papers comparing scaffold designs typically report all three (WebArena, Visual WebArena, OSWorld) to give a complete picture.
The community has converged on the headline trio as the appropriate benchmark portfolio for browser-agent claims. Mind2Web is now historical; WebShop is embedded; new benchmarks (BrowserBench, WebShop-2, others) are emerging but have not yet displaced the headline three. Expect this configuration to remain stable through 2027; the more interesting evolution is in scaffold design rather than benchmark design.
Q.01Which browser-agent benchmark should I quote in 2026?+
Q.02What's the difference between WebArena and OSWorld?+
Q.03Why is Visual WebArena harder than WebArena?+
Q.04Are Mind2Web and WebShop still relevant?+
Q.05What about commercial browser agents (Anthropic Computer Use, OpenAI Operator, Google Project Mariner)?+
Q.06How harness-sensitive are browser benchmarks?+
Sources
- [1] WebArena project site. webarena.dev. Accessed May 2026.
- [2] Visual WebArena. arXiv:2401.13649.
- [3] OSWorld project site. os-world.github.io.
- [4] BrowserGym library. github.com/ServiceNow/BrowserGym.
- [5] Mind2Web project. osu-nlp-group.github.io/Mind2Web.