WebArena: Real Browser-Agent Tasks on Self-Hosted Sites
The most reproducible browser-agent benchmark in 2026. Four self-hosted web applications, 812 tasks, deterministic success functions. The benchmark publishes its harness, and that openness is why third-party numbers can be trusted in a way most browser-agent claims cannot.
What WebArena measures
WebArena, introduced by Zhou et al. at CMU in October 2023, evaluates browser agents on 812 natural-language tasks across four web applications. The applications are dockerised: an e-commerce store, a Reddit-like forum, a GitLab instance, and a Wikipedia-style CMS. The agent reads HTML (or a screenshot in the Visual WebArena variant), selects actions (click, type, scroll, navigate), and must reach a goal state defined by the task. Success is determined by a programmatic check against the underlying database or page state, not by trajectory matching.
The four sites were chosen to span the categories of work people actually do online. The e-commerce site tests forms, cart manipulation, and account flows. The forum tests reading, search, and social actions. The GitLab instance tests structured CRUD and workflow management. The CMS tests reading and light editing. A model that does well across all four has demonstrated the kind of breadth that practical agent deployment requires. A model that does well only on the forum but fails on GitLab has revealed a real capability gap, not a benchmark artefact.
The benchmark's most important methodological choice is that the sites are real, full-stack web applications rather than simplified abstractions. The agent sees actual rendered HTML with realistic noise: navigation elements, ads, modals, broken layouts, and the long-tail of structures that real software exhibits. This is what makes WebArena scores meaningful in a way that benchmarks running over hand-cleaned page snapshots are not.
The four web applications
Each site is a real, well-known open-source web application configured for evaluation. Agents face the same kinds of pages that production browser agents encounter when deployed against real customer environments.
The relative difficulty profile is reasonably stable across model generations. CMS and forum sites are easier; GitLab is hardest because the action space is largest and the workflow semantics most specific. E-commerce sits in the middle: easy actions but long workflows. Researchers who break out per-site scores have a clearer picture than the headline overall number.
Scoring and success criteria
Each task has an explicit success function. For shopping tasks it checks the database state of the cart or order. For GitLab it checks the issue, MR, or label state. For CMS edits it checks page content. For forum actions it checks the vote, comment, or post state. Because the sites are local and resettable, these checks are deterministic and cheap to run at scale. There is no LLM judge in the success path.
The overall score reported on the leaderboard is the mean task success rate across all 812 tasks. Per-site means are also reported. Some papers report a weighted average that gives extra weight to longer tasks; we treat unweighted task-mean as the canonical comparison number because it matches the leaderboard convention.
One nuance: a task can succeed by accident. If the goal state is "cart contains item X" and the agent navigates to the wrong page, browses randomly, and happens to add X, the task is scored as success. This is intentional and matches the practical reality that users do not care how the agent reached the goal. It is also why intermediate-action accuracy is sometimes reported alongside task success, particularly in research papers comparing different harnesses.
SOTA progression 2023 to 2026
WebArena has been one of the steadiest agent benchmarks. Scores climbed from low teens at launch to mid-60s for the best frontier models with strong scaffolds by mid-2026. The progression is closer to a slope than the step-changes seen on saturated text benchmarks, which suggests there is real headroom left.
Visual WebArena and the multimodal extension
The same CMU group released Visual WebArena (VWA) in early 2024 with 910 additional tasks that require interpretation of screenshots rather than parsed HTML. VWA uses the same four sites and the same success-function approach, but the agent must read a rendered page image to ground its action. This matters because many real-world browser agents (Anthropic Computer Use, OpenAI Operator, Google Project Mariner) work primarily from screenshots, not the DOM.
VWA scores trail WebArena scores by 10 to 15 points across the frontier. Screenshot grounding remains the harder modality even for capable multimodal models. The gap has narrowed since VWA launched (initial frontier was around 30 percent VWA vs 40 percent WebArena; in 2026 it is around 50 percent VWA vs 60 percent WebArena) but has not closed. When choosing between the two for evaluation, the rule of thumb is: if your production agent consumes the DOM, quote WebArena; if it consumes screenshots, quote VWA.
Harness sensitivity and what it means for comparison
WebArena scores are the most harness-sensitive numbers in the agent-benchmark literature. The original CMU paper used a minimal ReAct-style harness with a fixed action vocabulary. Strong modern scaffolds add: planning, retrieval over the rendered DOM, hierarchical action selection, screenshot grounding, and self-correction loops. The same underlying model can score 25 percent in the original harness and 55 percent in a strong scaffold.
Three practical consequences. First, always quote the harness when quoting a WebArena number. Second, compare scaffold-to-scaffold, not raw model to raw model, when ranking. Third, be sceptical of numbers that exceed 60 percent without a clear harness disclosure: the most plausible explanation is best-of-n inference with a large n, which is not directly comparable to single-shot agentic deployment.
The wider issue is documented on our what benchmarks miss editorial. Harness sensitivity is a known limitation of the entire agentic-benchmark family. WebArena is unusually transparent about it because the harness is open-source; you can rerun the exact configuration anyone else used.
Where WebArena is brittle
Three brittleness vectors are worth knowing. First, the GitLab site contains tasks that depend on user permission states. A small fraction of tasks have ambiguous success criteria when the agent is logged in as the wrong account; this is a known issue documented in the project repo. Second, the forum site has a few tasks where multiple distinct trajectories reach the same goal state, which makes per-action accuracy metrics misleading even though task-success is well-defined. Third, the e-commerce site has been the most subject to prompt-engineering optimisation; published agent prompts in the 2025 literature are increasingly tuned for this specific Magento configuration, which is an honest form of overfitting.
None of these issues undermine WebArena as a research artefact. They do mean that small score differences (1 to 2 points) between two scaffolds are within the noise floor. Differences of 5 points or more reliably indicate a real capability gap.
Q.01What is WebArena and how is it different from web shopping benchmarks?+
Q.02What does WebArena measure that other agent benchmarks miss?+
Q.03Is the WebArena environment public or private?+
Q.04What is Visual WebArena?+
Q.05How does harness choice affect WebArena scores?+
Q.06Is WebArena gameable?+
Q.07Which models lead WebArena in 2026?+
Sources
- [1] Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
- [2] Koh, J. Y. et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649.
- [3] WebArena project site and leaderboard, accessed May 2026. webarena.dev.
- [4] WebArena reproducibility repository. github.com/web-arena-x/webarena.