WebArena: Real Browser-Agent Tasks on Self-Hosted Sites
The most reproducible browser-agent benchmark in 2026. Four self-hosted web applications, 812 tasks, deterministic success functions. The benchmark publishes its harness, and that openness is why third-party numbers can be trusted in a way most browser-agent claims cannot.
What WebArena measures
WebArena, introduced by Zhou et al. at CMU in October 2023, evaluates browser agents on 812 natural-language tasks across four web applications. The applications are dockerised: an e-commerce store, a Reddit-like forum, a GitLab instance, and a Wikipedia-style CMS. The agent reads HTML (or a screenshot in the Visual WebArena variant), selects actions (click, type, scroll, navigate), and must reach a goal state defined by the task. Success is determined by a programmatic check against the underlying database or page state, not by trajectory matching.
The four sites were chosen to span the categories of work people actually do online. The e-commerce site tests forms, cart manipulation, and account flows. The forum tests reading, search, and social actions. The GitLab instance tests structured CRUD and workflow management. The CMS tests reading and light editing. A model that does well across all four has demonstrated the kind of breadth that practical agent deployment requires. A model that does well only on the forum but fails on GitLab has revealed a real capability gap, not a benchmark artefact.
The benchmark's most important methodological choice is that the sites are real, full-stack web applications rather than simplified abstractions. The agent sees actual rendered HTML with realistic noise: navigation elements, ads, modals, broken layouts, and the long-tail of structures that real software exhibits. This is what makes WebArena scores meaningful in a way that benchmarks running over hand-cleaned page snapshots are not.
The four web applications
Each site is a real, well-known open-source web application configured for evaluation. Agents face the same kinds of pages that production browser agents encounter when deployed against real customer environments.
The relative difficulty profile is reasonably stable across model generations. CMS and forum sites are easier; GitLab is hardest because the action space is largest and the workflow semantics most specific. E-commerce sits in the middle: easy actions but long workflows. Researchers who break out per-site scores have a clearer picture than the headline overall number.
Scoring and success criteria
Each task has an explicit success function. For shopping tasks it checks the database state of the cart or order. For GitLab it checks the issue, MR, or label state. For CMS edits it checks page content. For forum actions it checks the vote, comment, or post state. Because the sites are local and resettable, these checks are deterministic and cheap to run at scale. There is no LLM judge in the success path.
The overall score reported on the leaderboard is the mean task success rate across all 812 tasks. Per-site means are also reported. Some papers report a weighted average that gives extra weight to longer tasks; we treat unweighted task-mean as the canonical comparison number because it matches the leaderboard convention.
One nuance: a task can succeed by accident. If the goal state is "cart contains item X" and the agent navigates to the wrong page, browses randomly, and happens to add X, the task is scored as success. This is intentional and matches the practical reality that users do not care how the agent reached the goal. It is also why intermediate-action accuracy is sometimes reported alongside task success, particularly in research papers comparing different harnesses.
Reading the scores
The original WebArena paper reported a GPT-4 agent in a ReAct harness at 14.4% overall against a human baseline of roughly 78%, which is the one cleanly-sourced anchor for the benchmark. Scores have climbed since, but we do not reprint a per-model progression table: the webarena.dev leaderboard has not been maintained continuously, current numbers come from scattered papers and vendor model cards under different harnesses, and a WebArena score is only meaningful with its harness attached. Cross-source numbers are not comparable cell-to-cell.
For current results, read the official WebArena project site and the harness disclosure in each cited paper together. To choose the right benchmark for your use case rather than chase a single figure, start from the homepage task picker. Quote the scaffold alongside the model whenever you cite a WebArena number.
Visual WebArena and the multimodal extension
The same CMU group released Visual WebArena (VWA) in early 2024 with 910 additional tasks that require interpretation of screenshots rather than parsed HTML. VWA uses the same four sites and the same success-function approach, but the agent must read a rendered page image to ground its action. This matters because many real-world browser agents (Anthropic Computer Use, OpenAI Operator, Google Project Mariner) work primarily from screenshots, not the DOM.
VWA scores trail WebArena scores across the field, and the gap has not closed: screenshot grounding remains the harder modality even for capable multimodal models. When choosing between the two for evaluation, the rule of thumb is: if your production agent consumes the DOM, quote WebArena; if it consumes screenshots, quote VWA. For current numbers on either, read the project sites rather than a copied table.
Harness sensitivity and what it means for comparison
WebArena scores are the most harness-sensitive numbers in the agent-benchmark literature. The original CMU paper used a minimal ReAct-style harness with a fixed action vocabulary. Strong modern scaffolds add: planning, retrieval over the rendered DOM, hierarchical action selection, screenshot grounding, and self-correction loops. The same underlying model can score 25 percent in the original harness and 55 percent in a strong scaffold.
Three practical consequences. First, always quote the harness when quoting a WebArena number. Second, compare scaffold-to-scaffold, not raw model to raw model, when ranking. Third, be sceptical of numbers that exceed 60 percent without a clear harness disclosure: the most plausible explanation is best-of-n inference with a large n, which is not directly comparable to single-shot agentic deployment.
The wider issue is documented on our what benchmarks miss editorial. Harness sensitivity is a known limitation of the entire agentic-benchmark family. WebArena is unusually transparent about it because the harness is open-source; you can rerun the exact configuration anyone else used.
Where WebArena is brittle
Three brittleness vectors are worth knowing. First, the GitLab site contains tasks that depend on user permission states. A small fraction of tasks have ambiguous success criteria when the agent is logged in as the wrong account; this is a known issue documented in the project repo. Second, the forum site has a few tasks where multiple distinct trajectories reach the same goal state, which makes per-action accuracy metrics misleading even though task-success is well-defined. Third, the e-commerce site has been the most subject to prompt-engineering optimisation; published agent prompts in the 2025 literature are increasingly tuned for this specific Magento configuration, which is an honest form of overfitting.
None of these issues undermine WebArena as a research artefact. They do mean that small score differences (1 to 2 points) between two scaffolds are within the noise floor. Differences of 5 points or more reliably indicate a real capability gap.
Q.01What is WebArena and how is it different from web shopping benchmarks?+
Q.02What does WebArena measure that other agent benchmarks miss?+
Q.03Is the WebArena environment public or private?+
Q.04What is Visual WebArena?+
Q.05How does harness choice affect WebArena scores?+
Q.06Is WebArena gameable?+
Q.07Where can I see current WebArena scores?+
Sources
- [1] Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
- [2] Koh, J. Y. et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649.
- [3] WebArena project site and leaderboard. webarena.dev.
- [4] WebArena reproducibility repository. github.com/web-arena-x/webarena.