Abstract

What812 web-task agent benchmark across 4 self-hosted apps: shop, forum, GitLab, CMS

WhoZhou et al., Carnegie Mellon, 2023 (arXiv:2307.13854)

2026 TierFrontier with strong scaffold: 55 to 65 percent; humans at 78 percent.

Section II.ii · Agent Benchmarks|Last verified April 2026

WebArena: Real Browser-Agent Tasks on Self-Hosted Sites

The most reproducible browser-agent benchmark in 2026. Four self-hosted web applications, 812 tasks, deterministic success functions. The benchmark publishes its harness, and that openness is why third-party numbers can be trusted in a way most browser-agent claims cannot.

What WebArena measures

WebArena, introduced by Zhou et al. at CMU in October 2023, evaluates browser agents on 812 natural-language tasks across four web applications. The applications are dockerised: an e-commerce store, a Reddit-like forum, a GitLab instance, and a Wikipedia-style CMS. The agent reads HTML (or a screenshot in the Visual WebArena variant), selects actions (click, type, scroll, navigate), and must reach a goal state defined by the task. Success is determined by a programmatic check against the underlying database or page state, not by trajectory matching.

The four sites were chosen to span the categories of work people actually do online. The e-commerce site tests forms, cart manipulation, and account flows. The forum tests reading, search, and social actions. The GitLab instance tests structured CRUD and workflow management. The CMS tests reading and light editing. A model that does well across all four has demonstrated the kind of breadth that practical agent deployment requires. A model that does well only on the forum but fails on GitLab has revealed a real capability gap, not a benchmark artefact.

The benchmark's most important methodological choice is that the sites are real, full-stack web applications rather than simplified abstractions. The agent sees actual rendered HTML with realistic noise: navigation elements, ads, modals, broken layouts, and the long-tail of structures that real software exhibits. This is what makes WebArena scores meaningful in a way that benchmarks running over hand-cleaned page snapshots are not.

The four web applications

Each site is a real, well-known open-source web application configured for evaluation. Agents face the same kinds of pages that production browser agents encounter when deployed against real customer environments.

Site

Task domain

OneStopShop (e-commerce)

A Magento-style store with products, categories, account flows, cart, checkout, order history. Tasks include 'buy three items in category X under $50' and 'cancel my most recent order and reorder with size large'.

Reddit (postmill forum)

A self-hosted forum app. Tasks include find-and-upvote, cross-post, comment-thread navigation, search by user. Tests reading-comprehension-plus-action more than other sites.

GitLab (software dev)

Self-hosted GitLab instance. Tasks include create-issue, label, assign, merge-request review, milestone management. The hardest site for most agents because the action space is large.

Wikipedia-like CMS (Wikimedia)

MediaWiki-style CMS. Tasks include find-article, edit, citation handling, category navigation. Closer to knowledge retrieval than action; benchmarks reading-and-light-editing.

The relative difficulty profile is reasonably stable across model generations. CMS and forum sites are easier; GitLab is hardest because the action space is largest and the workflow semantics most specific. E-commerce sits in the middle: easy actions but long workflows. Researchers who break out per-site scores have a clearer picture than the headline overall number.

Scoring and success criteria

Each task has an explicit success function. For shopping tasks it checks the database state of the cart or order. For GitLab it checks the issue, MR, or label state. For CMS edits it checks page content. For forum actions it checks the vote, comment, or post state. Because the sites are local and resettable, these checks are deterministic and cheap to run at scale. There is no LLM judge in the success path.

The overall score reported on the leaderboard is the mean task success rate across all 812 tasks. Per-site means are also reported. Some papers report a weighted average that gives extra weight to longer tasks; we treat unweighted task-mean as the canonical comparison number because it matches the leaderboard convention.

One nuance: a task can succeed by accident. If the goal state is "cart contains item X" and the agent navigates to the wrong page, browses randomly, and happens to add X, the task is scored as success. This is intentional and matches the practical reality that users do not care how the agent reached the goal. It is also why intermediate-action accuracy is sometimes reported alongside task success, particularly in research papers comparing different harnesses.

SOTA progression 2023 to 2026

WebArena has been one of the steadiest agent benchmarks. Scores climbed from low teens at launch to mid-60s for the best frontier models with strong scaffolds by mid-2026. The progression is closer to a slope than the step-changes seen on saturated text benchmarks, which suggests there is real headroom left.

Date

Tier

Note

Oct 2023

GPT-4 baseline at 14.4% (overall)

Original WebArena paper, ReAct harness, rendered HTML.

Feb 2024

Frontier models climb to ~25%

Agentic scaffolds with planning and retrieval start to land.

Jul 2024

Strong agentic frontier at ~40%

Multi-turn refinement and DOM retrieval close half the human gap.

Jan 2025

Frontier at ~50%

Vision-grounded scaffolds available; agentic GPT-class around mid-50s.

Apr 2026

Frontier 55 to 65% range

Closed-source with proprietary scaffolds at the top; open-weight catches up.

Human baseline

~78%

Reported in the original paper; humans also fail tasks, particularly on GitLab.

Visual WebArena and the multimodal extension

The same CMU group released Visual WebArena (VWA) in early 2024 with 910 additional tasks that require interpretation of screenshots rather than parsed HTML. VWA uses the same four sites and the same success-function approach, but the agent must read a rendered page image to ground its action. This matters because many real-world browser agents (Anthropic Computer Use, OpenAI Operator, Google Project Mariner) work primarily from screenshots, not the DOM.

VWA scores trail WebArena scores by 10 to 15 points across the frontier. Screenshot grounding remains the harder modality even for capable multimodal models. The gap has narrowed since VWA launched (initial frontier was around 30 percent VWA vs 40 percent WebArena; in 2026 it is around 50 percent VWA vs 60 percent WebArena) but has not closed. When choosing between the two for evaluation, the rule of thumb is: if your production agent consumes the DOM, quote WebArena; if it consumes screenshots, quote VWA.

Harness sensitivity and what it means for comparison

WebArena scores are the most harness-sensitive numbers in the agent-benchmark literature. The original CMU paper used a minimal ReAct-style harness with a fixed action vocabulary. Strong modern scaffolds add: planning, retrieval over the rendered DOM, hierarchical action selection, screenshot grounding, and self-correction loops. The same underlying model can score 25 percent in the original harness and 55 percent in a strong scaffold.

Three practical consequences. First, always quote the harness when quoting a WebArena number. Second, compare scaffold-to-scaffold, not raw model to raw model, when ranking. Third, be sceptical of numbers that exceed 60 percent without a clear harness disclosure: the most plausible explanation is best-of-n inference with a large n, which is not directly comparable to single-shot agentic deployment.

The wider issue is documented on our what benchmarks miss editorial. Harness sensitivity is a known limitation of the entire agentic-benchmark family. WebArena is unusually transparent about it because the harness is open-source; you can rerun the exact configuration anyone else used.

Where WebArena is brittle

Three brittleness vectors are worth knowing. First, the GitLab site contains tasks that depend on user permission states. A small fraction of tasks have ambiguous success criteria when the agent is logged in as the wrong account; this is a known issue documented in the project repo. Second, the forum site has a few tasks where multiple distinct trajectories reach the same goal state, which makes per-action accuracy metrics misleading even though task-success is well-defined. Third, the e-commerce site has been the most subject to prompt-engineering optimisation; published agent prompts in the 2025 literature are increasingly tuned for this specific Magento configuration, which is an honest form of overfitting.

None of these issues undermine WebArena as a research artefact. They do mean that small score differences (1 to 2 points) between two scaffolds are within the noise floor. Differences of 5 points or more reliably indicate a real capability gap.

Editor's verdictWebArena is the most reproducible browser-agent benchmark and the right headline number for DOM-consuming agents in 2026. For screenshot-consuming agents prefer Visual WebArena. Always disclose the harness; treat numbers above 60 percent without scaffolding detail as unverified.

Reader Questions

Q.01What is WebArena and how is it different from web shopping benchmarks?+

WebArena is a benchmark for browser agents created at Carnegie Mellon, hosted at webarena.dev. It evaluates agents on 812 tasks across four self-hosted web applications: an e-commerce store, a Reddit-style forum, GitLab, and a CMS. Unlike WebShop, the websites are full applications with realistic site structure, account systems, and CRUD workflows. Unlike Mind2Web, the tasks are executed against a live, replayable environment rather than pre-recorded page sequences. Each task has a deterministic success function that checks final state, not the trajectory.

Q.02What does WebArena measure that other agent benchmarks miss?+

WebArena tests grounded action on real web applications across long horizons. A typical task takes 15 to 40 actions and involves form filling, navigation, search, account state, and CRUD operations. The benchmark exposes weaknesses in three places that simpler browser benchmarks miss: maintaining state across multi-page workflows, recovering from incorrect actions, and parsing rendered HTML rather than a clean DOM API. Frontier models still struggle here in 2026; agentic success rate is around 50 to 60 percent overall, not the 90 percent territory of saturated text benchmarks.

Q.03Is the WebArena environment public or private?+

WebArena is fully open-source. The four web applications are dockerised and run locally during evaluation. The task set, evaluation harness, and reproducibility scripts are all on the project GitHub. This is why WebArena has higher trust than vendor-published browser-agent numbers: anyone can rerun the exact same configuration. The Visual WebArena follow-up adds screenshot-grounded multimodal tasks but uses the same backend.

Q.04What is Visual WebArena?+

Visual WebArena, released by the same CMU group in late 2024, extends WebArena with 910 multimodal tasks that require interpretation of screenshots. The same four sites are used, but the tasks involve visual elements like product images, charts, or layout-based instructions. Frontier multimodal models scored 30 to 40 percent on VWA when it launched; that has improved to around 50 percent in 2026 for the best vision-capable agents. VWA is the more relevant variant for agents that consume screenshots rather than parsed HTML.

Q.05How does harness choice affect WebArena scores?+

Significantly. The original WebArena paper used a relatively simple ReAct-style prompt over rendered HTML. Modern agentic scaffolding adds planning, retrieval over the page DOM, and screenshot-grounded vision. The same underlying model can score 25 percent in the original harness and 55 percent in a strong agentic scaffold. When reading a WebArena score, the harness disclosure matters as much as the model name. Anthropic and OpenAI publish WebArena numbers using proprietary scaffolds; community submissions typically use Browser-Gym or the original CMU harness.

Q.06Is WebArena gameable?+

Less than most agent benchmarks, but not immune. The deterministic success function checks final database state on the four self-hosted sites, which makes brute-force trajectory generation expensive but possible. The forum site has tasks like 'find the post about X and upvote it' where the agent can in principle traverse the entire site once and find the target. This brute-force strategy is impractical inside a turn budget, so most evaluation runs are honest. The risk is higher when teams publish best-of-n scores without disclosing n.

Q.07Which models lead WebArena in 2026?+

The current frontier on WebArena sits in the 55 to 65 percent range for closed-source models with strong agentic scaffolds. Open-weight models with the same harness score 35 to 45 percent. The original WebArena leaderboard at webarena.dev has not been updated as aggressively as the SWE-bench leaderboard, so the most current numbers come from individual research papers and model cards rather than a central leaderboard. We treat self-reported numbers without a public reproduction recipe as unverified.

Agent Benchmarks Overview →AgentBench →OSWorld Benchmark →Best Browser-Agent Benchmarks →Tau-Bench →SWE-bench Verified →What Benchmarks Miss →

Sources

[1] Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854.
[2] Koh, J. Y. et al. (2024). VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649.
[3] WebArena project site and leaderboard, accessed May 2026. webarena.dev.
[4] WebArena reproducibility repository. github.com/web-arena-x/webarena.