Section II.vi Agent Benchmarks|Last verified April 2026
AndroidWorld Benchmark: 116 Mobile Tasks, Frontier at 30.6%
Mobile-UI agency is the frontier the web benchmarks do not measure.
I
Construction
The DeepMind team selected 20 Android apps representative of consumer use (Calendar, Camera, Chrome, Clock, Contacts, Files, Markor, Messages, OsmAnd, Recorder, Retro Music, Simple Calendar Pro, Simple Draw Pro, Simple Gallery Pro, Simple SMS Messenger, Tasks, Vlc, Wikipedia, Browser, OpenTracks) and authored 116 tasks across them. Each task is parameterised: a "set an alarm for 7am tomorrow" task instantiates with a randomly chosen time and label per run, which reduces memorisation across runs.
The agent receives screenshots and the Android accessibility tree at each step. It returns a touch or text action. The framework executes the action, captures the new state, and continues. Scoring is binary success per task.
II
SOTA Progression
Date
Tier / Score
Note
May 2024
SeeAct-V (GPT-4V base) at 30.6%
Original Rawles et al. paper baseline.
Sep 2024
M3A (GPT-4o + multimodal) at 38.0%
Set-of-mark grounding plus better vision.
Mar 2025
Gemini 2 + UI-Net at 46.2%
Google DeepMind follow-up paper.
Apr 2026
Frontier around 52% with mobile-specific harness
Captured from public reports; subset-dependent.
III
Why Mobile Is Harder Than Web
The accessibility tree on Android is noisier than the DOM. Custom views, theming, and Compose-based UIs frequently omit semantic labels, leaving the agent to ground actions visually. Vision-language grounding on mobile screens is still the limiting factor, not language understanding. This is why headline scores climb fastest when better grounding methods (Set-of-Mark, ScreenAI, UI-Net) ship rather than when language models get better.
AndroidWorld is a benchmark from Google DeepMind that evaluates autonomous mobile UI agents on 116 tasks spread across 20 real Android apps including Calendar, Contacts, Messages, Markor, Chrome, and Files. The agent receives a task in natural language, then interacts with an Android emulator via a screen-reading and touch-action API.
Q.02What was the headline launch number?+
The original paper reported 30.6% success rate for SeeAct-V (a Set-of-Mark + reasoning agent built on GPT-4V) on the 116-task test set. Baselines using earlier vision-language models scored in the single digits. The leaderboard moves quickly as new mobile-vision models ship.
Q.03How does AndroidWorld differ from WebArena?+
AndroidWorld uses real Android applications running in an emulator. The action space is touch gestures (tap, swipe, type) on screen coordinates returned by an accessibility tree. WebArena uses self-hosted web applications and a DOM-based action space. Mobile UIs are visually denser, lack consistent semantic markup, and require gesture composition that DOM agents do not encounter.
Q.04Is AndroidWorld dynamic or static?+
Dynamic. Each task triggers a real Android emulator. Apps maintain state between actions. The benchmark ships dynamic task generation (some tasks instantiate with random parameters per run) which reduces memorisation. The downside is that runs are expensive: a 116-task evaluation pass takes hours of compute on real emulator infrastructure.
Q.05Why are mobile-UI scores lower than web scores?+
Three reasons. First, mobile screens lack the structured DOM that web agents lean on for grounding. Second, gesture composition (multi-step swipe-and-tap sequences) is brittle for current vision-language models. Third, app-specific UX conventions (Material Design vs custom theming) confuse agents that have seen only a few apps in training. The gap closes as mobile-specific vision-language models ship.
Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.