Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What116 mobile UI tasks across 20 real Android apps, real emulator, real touch actions.
WhoRawles, Clinckemaillie, Chang, Waltz, Lau, Vasselli, Vasudevan, Mougenot, Lillicrap, Riedmiller (Google DeepMind, 2024).
2026 TierFrontier around 52% with mobile-specific harness; not saturated.
Projectgoogle-research.github.io/android_world
Section II.vi Agent Benchmarks|Last verified April 2026

AndroidWorld Benchmark: 116 Mobile Tasks, Frontier at 30.6%

Mobile-UI agency is the frontier the web benchmarks do not measure.

I

Construction

The DeepMind team selected 20 Android apps representative of consumer use (Calendar, Camera, Chrome, Clock, Contacts, Files, Markor, Messages, OsmAnd, Recorder, Retro Music, Simple Calendar Pro, Simple Draw Pro, Simple Gallery Pro, Simple SMS Messenger, Tasks, Vlc, Wikipedia, Browser, OpenTracks) and authored 116 tasks across them. Each task is parameterised: a "set an alarm for 7am tomorrow" task instantiates with a randomly chosen time and label per run, which reduces memorisation across runs.

The agent receives screenshots and the Android accessibility tree at each step. It returns a touch or text action. The framework executes the action, captures the new state, and continues. Scoring is binary success per task.

II

SOTA Progression

Date
Tier / Score
Note
May 2024
SeeAct-V (GPT-4V base) at 30.6%
Original Rawles et al. paper baseline.
Sep 2024
M3A (GPT-4o + multimodal) at 38.0%
Set-of-mark grounding plus better vision.
Mar 2025
Gemini 2 + UI-Net at 46.2%
Google DeepMind follow-up paper.
Apr 2026
Frontier around 52% with mobile-specific harness
Captured from public reports; subset-dependent.
III

Why Mobile Is Harder Than Web

The accessibility tree on Android is noisier than the DOM. Custom views, theming, and Compose-based UIs frequently omit semantic labels, leaving the agent to ground actions visually. Vision-language grounding on mobile screens is still the limiting factor, not language understanding. This is why headline scores climb fastest when better grounding methods (Set-of-Mark, ScreenAI, UI-Net) ship rather than when language models get better.

OSWorld for desktop agentsWebArena for web agentsMind2Web action prediction
Reader Questions
Q.01What is AndroidWorld?+
AndroidWorld is a benchmark from Google DeepMind that evaluates autonomous mobile UI agents on 116 tasks spread across 20 real Android apps including Calendar, Contacts, Messages, Markor, Chrome, and Files. The agent receives a task in natural language, then interacts with an Android emulator via a screen-reading and touch-action API.
Q.02What was the headline launch number?+
The original paper reported 30.6% success rate for SeeAct-V (a Set-of-Mark + reasoning agent built on GPT-4V) on the 116-task test set. Baselines using earlier vision-language models scored in the single digits. The leaderboard moves quickly as new mobile-vision models ship.
Q.03How does AndroidWorld differ from WebArena?+
AndroidWorld uses real Android applications running in an emulator. The action space is touch gestures (tap, swipe, type) on screen coordinates returned by an accessibility tree. WebArena uses self-hosted web applications and a DOM-based action space. Mobile UIs are visually denser, lack consistent semantic markup, and require gesture composition that DOM agents do not encounter.
Q.04Is AndroidWorld dynamic or static?+
Dynamic. Each task triggers a real Android emulator. Apps maintain state between actions. The benchmark ships dynamic task generation (some tasks instantiate with random parameters per run) which reduces memorisation. The downside is that runs are expensive: a 116-task evaluation pass takes hours of compute on real emulator infrastructure.
Q.05Why are mobile-UI scores lower than web scores?+
Three reasons. First, mobile screens lack the structured DOM that web agents lean on for grounding. Second, gesture composition (multi-step swipe-and-tap sequences) is brittle for current vision-language models. Third, app-specific UX conventions (Material Design vs custom theming) confuse agents that have seen only a few apps in training. The gap closes as mobile-specific vision-language models ship.

Sources

  1. [1] Rawles et al. (2024): arxiv.org/abs/2405.14573
  2. [2] AndroidWorld project: google-research.github.io/android_world
  3. [3] AndroidWorld repository: github.com/google-research/android_world
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.