Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatUnified Gym-style API for browser-agent benchmarks; wraps WebArena, VisualWebArena, MiniWoB++, WorkArena, AssistantBench Live.
WhoServiceNow Research (Drouin, Gasse, Lacoste et al.), 2024.
2026 UseCross-benchmark browser-agent evaluation; reduces per-benchmark adapter noise.
Repositorygithub.com/ServiceNow/BrowserGym
Section V.vii Tools|Last verified April 2026

BrowserGym: One Gym Interface for WebArena, VisualWebArena, MiniWoB++

The standardisation move that made browser-agent results comparable across benchmarks.

I

The API

BrowserGym mirrors the OpenAI Gym API that became standard in reinforcement learning. An environment exposes reset() to start a new task, step(action) to perform an action and receive the next observation, and a small inventory of higher-level helpers. Observations include the accessibility tree, optional rendered screenshot, optional axtree, current URL, and task instruction. Actions include click, fill, hover, scroll, navigate, select_option, press, and a few higher-level macros.

II

The Wrapped Benchmarks

Benchmark
What it adds
MiniWoB++
Classic short-horizon tasks; useful as a regression smoke test.
WebArena
Self-hosted multi-app environment; end-to-end task completion.
VisualWebArena
Image-aware extension; multimodal grounding required.
WorkArena
ServiceNow's own enterprise-app benchmark.
AssistantBench Live
Live public-web research tasks.
III

When To Use BrowserGym

If you are publishing or comparing browser-agent results, use BrowserGym to remove the per-benchmark adapter as a confound. If your agent does something exotic that the BrowserGym action space cannot represent, extend the action space (the upstream maintainers accept PRs) rather than fall back to custom adapters that re-introduce the noise BrowserGym exists to eliminate.

WebArena methodologyMind2Web for action predictionBrowser-agent benchmarks compared
Reader Questions
Q.01What is BrowserGym?+
BrowserGym is a unified Gym-style API for browser-agent benchmarks, released by ServiceNow Research in 2024. The framework wraps several otherwise-incompatible benchmarks (WebArena, VisualWebArena, MiniWoB++, WorkArena, AssistantBench Live) behind a single observation and action interface. An agent written once runs against all of them with no per-benchmark adapter code.
Q.02Why does that matter?+
Before BrowserGym, comparing a browser agent across WebArena and MiniWoB++ required writing two adapters, one per benchmark, with subtly different observation formats and action vocabularies. Comparisons were error-prone and dependent on the adapter author. BrowserGym standardises the API, which means a published BrowserGym agent number is directly comparable across the wrapped benchmarks. The framework removed a class of methodology noise.
Q.03Which benchmarks does BrowserGym wrap?+
As of mid-2026: WebArena (self-hosted Reddit, Gitea, GitLab, Wikipedia, Map), VisualWebArena (image-aware WebArena), MiniWoB++ (the classic short-horizon web tasks), WorkArena (ServiceNow's own enterprise-app benchmark), AssistantBench Live (live web research), and several smaller research benchmarks. ServiceNow accepts upstream contributions for new wrappers, and the list grows roughly quarterly.
Q.04Does using BrowserGym constrain my agent?+
Slightly. The agent must read observations in the BrowserGym format (accessibility tree, optional screenshot, optional axtree) and emit actions in the BrowserGym vocabulary (click, type, scroll, navigate, plus a small set of higher-level macros). If your agent does something exotic (raw mouse coordinates, multi-tab management beyond the standard set), you may need to extend the action space. For most agents the standard API is sufficient.
Q.05How does it compare to AgentBench's harness?+
AgentBench has its own harness for its 8 environments but does not unify with WebArena or MiniWoB++. BrowserGym is browser-only and unifies the browser benchmark family. The two are orthogonal: an agent might be wrapped in BrowserGym for browser eval and a separate harness for AgentBench's OS-shell and DB environments.

Sources

  1. [1] BrowserGym repository: github.com/ServiceNow/BrowserGym
  2. [2] WorkArena paper (Drouin et al. 2024): arxiv.org/abs/2403.07718
  3. [3] VisualWebArena paper (Koh et al. 2024): arxiv.org/abs/2401.13649
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.