Abstract

What116 mobile UI tasks across 20 real Android apps, real emulator, real touch actions.

WhoRawles, Clinckemaillie, Chang, Waltz, Lau, Vasselli, Vasudevan, Mougenot, Lillicrap, Riedmiller (Google DeepMind, 2024).

2026 TierFrontier around 52% with mobile-specific harness; not saturated.

Projectgoogle-research.github.io/android_world

Section II.vi Agent Benchmarks|Last verified April 2026

AndroidWorld Benchmark: 116 Mobile Tasks, Frontier at 30.6%

Mobile-UI agency is the frontier the web benchmarks do not measure.

Construction

The DeepMind team selected 20 Android apps representative of consumer use (Calendar, Camera, Chrome, Clock, Contacts, Files, Markor, Messages, OsmAnd, Recorder, Retro Music, Simple Calendar Pro, Simple Draw Pro, Simple Gallery Pro, Simple SMS Messenger, Tasks, Vlc, Wikipedia, Browser, OpenTracks) and authored 116 tasks across them. Each task is parameterised: a "set an alarm for 7am tomorrow" task instantiates with a randomly chosen time and label per run, which reduces memorisation across runs.

The agent receives screenshots and the Android accessibility tree at each step. It returns a touch or text action. The framework executes the action, captures the new state, and continues. Scoring is binary success per task.

SOTA Progression

Date

Tier / Score

Note

May 2024

SeeAct-V (GPT-4V base) at 30.6%

Original Rawles et al. paper baseline.

Sep 2024

M3A (GPT-4o + multimodal) at 38.0%

Set-of-mark grounding plus better vision.

Mar 2025

Gemini 2 + UI-Net at 46.2%

Google DeepMind follow-up paper.

Apr 2026

Frontier around 52% with mobile-specific harness

Captured from public reports; subset-dependent.

III

Why Mobile Is Harder Than Web

The accessibility tree on Android is noisier than the DOM. Custom views, theming, and Compose-based UIs frequently omit semantic labels, leaving the agent to ground actions visually. Vision-language grounding on mobile screens is still the limiting factor, not language understanding. This is why headline scores climb fastest when better grounding methods (Set-of-Mark, ScreenAI, UI-Net) ship rather than when language models get better.

OSWorld for desktop agents →WebArena for web agents →Mind2Web action prediction →

Reader Questions

Q.01What is AndroidWorld?+

AndroidWorld is a benchmark from Google DeepMind that evaluates autonomous mobile UI agents on 116 tasks spread across 20 real Android apps including Calendar, Contacts, Messages, Markor, Chrome, and Files. The agent receives a task in natural language, then interacts with an Android emulator via a screen-reading and touch-action API.

Q.02What was the headline launch number?+

The original paper reported 30.6% success rate for SeeAct-V (a Set-of-Mark + reasoning agent built on GPT-4V) on the 116-task test set. Baselines using earlier vision-language models scored in the single digits. The leaderboard moves quickly as new mobile-vision models ship.

Q.03How does AndroidWorld differ from WebArena?+

AndroidWorld uses real Android applications running in an emulator. The action space is touch gestures (tap, swipe, type) on screen coordinates returned by an accessibility tree. WebArena uses self-hosted web applications and a DOM-based action space. Mobile UIs are visually denser, lack consistent semantic markup, and require gesture composition that DOM agents do not encounter.

Q.04Is AndroidWorld dynamic or static?+

Dynamic. Each task triggers a real Android emulator. Apps maintain state between actions. The benchmark ships dynamic task generation (some tasks instantiate with random parameters per run) which reduces memorisation. The downside is that runs are expensive: a 116-task evaluation pass takes hours of compute on real emulator infrastructure.

Q.05Why are mobile-UI scores lower than web scores?+

Three reasons. First, mobile screens lack the structured DOM that web agents lean on for grounding. Second, gesture composition (multi-step swipe-and-tap sequences) is brittle for current vision-language models. Third, app-specific UX conventions (Material Design vs custom theming) confuse agents that have seen only a few apps in training. The gap closes as mobile-specific vision-language models ship.

Sources

[1] Rawles et al. (2024): arxiv.org/abs/2405.14573
[2] AndroidWorld project: google-research.github.io/android_world
[3] AndroidWorld repository: github.com/google-research/android_world