Abstract

What369 real-OS tasks across browser, office, file management, code editor, image editor, terminal

WhoXie et al., HKU + Salesforce + CMU, 2024 (arXiv:2404.07972)

2026 TierFrontier 35-45 percent (Linux-core); humans at ~72 percent.

Leaderboardos-world.github.io

Section II.v · Agent Benchmarks|Last verified April 2026

OSWorld: Computer-Use Agents on Real Operating Systems

The hardest agent benchmark in active use in 2026. Real applications, real screenshots, low-level mouse-and-keyboard actions, 369 tasks across office, browser, files, code, image, and terminal. Frontier agents still trail humans by 30 points; this is where the next two years of agent capability will be measured.

What OSWorld measures

OSWorld, introduced by Xie et al. in April 2024, evaluates agents on real computer-use tasks across an actual Ubuntu Linux desktop. The benchmark contains 369 task scenarios, each defined by an initial system state, a natural-language instruction, and a programmatic success function that checks the file system or application state after the agent finishes. Tasks span six broad areas (see below) and routinely require the agent to switch between native applications, handle real UIs, and reason over rendered screenshots rather than parsed text.

The defining choice is that OSWorld uses a real OS, not a simulated environment. The agent runs against an actual Ubuntu VM with actual Firefox, LibreOffice, GNOME Files, GIMP, and so on. Screenshots are real renders; mouse and keyboard actions are sent to real applications; failures are real (windows can crash, dialogs can appear, file dialogs can have unexpected defaults). This is the closest public benchmark to the production conditions Anthropic Computer Use, OpenAI Operator, and Google Project Mariner are trying to solve. It is also why scores are so much lower than on cleaner abstractions like WebArena.

OSWorld's second defining choice is that the action space is keyboard and mouse, not DOM actions or tool calls. The agent decides what to click and where, what to type, where to scroll. Pixel-precision matters: misreading the position of a button by ten pixels on a small UI element can derail a long trajectory. This puts OSWorld squarely in the territory of vision-language agents; text-only models cannot meaningfully attempt the benchmark.

The six task areas

The 369 tasks are distributed unevenly across six task areas, with web-and-office accounting for roughly half the corpus.

Task area

What it tests

Office productivity

LibreOffice Writer/Calc/Impress tasks: edit documents, build spreadsheets, compose presentations from given data. Sensitive to UI navigation precision.

Web + browser

Firefox or Chrome tasks: similar to WebArena but with native browser navigation rather than DOM access. Tests visual grounding more than HTML parsing.

File management

GNOME Files / Nautilus: locate, move, rename, archive, extract. The most reliable signal of basic computer-use competence.

Code editing

VS Code tasks: open project, edit specific file, save, run. Overlaps with SWE-bench in spirit but tests the IDE workflow rather than the patch.

Image editing

GIMP tasks: crop, resize, adjust, export. The most VLM-heavy category; vision capability dominates.

Terminal

Bash tasks: navigate, run scripts, manage processes. Closest to AgentBench OS environment but with a real terminal UI rather than a clean tool interface.

Per-area scores vary widely. File management is the easiest category (clear UI elements, deterministic outcomes) and image editing is the hardest (continuous parameter spaces, subjective output). Most frontier agents have a relative pattern across areas that is consistent across model generations: file management at the top, then browser, then office, then code, then terminal, with image editing at the bottom.

Screenshot-grounded scoring

Each OSWorld task ships with a success function that runs after the agent declares completion. The success function reads the file system or application state and checks whether the expected outcome was achieved: a file exists at a specific path, a spreadsheet cell contains the expected value, a presentation has the expected number of slides with the expected titles. The check is programmatic; there is no LLM-as-judge in the success path.

This is the same approach WebArena and GAIA take, and it is a major reason these three benchmarks are more trustworthy than agent benchmarks that rely on LLM-graded outputs. The downside is that the success function must be carefully written: a slightly mis-specified check can either over-credit (a partial completion scores success because the check is too permissive) or under-credit (a correct answer in an unexpected format scores failure). The OSWorld authors have iterated on the success functions since launch, and the third-party reproducibility tests confirm they are now broadly stable.

One nuance: a task can succeed by accident. If an agent fumbles around and happens to click the right thing, the success function will credit the trajectory. This is consistent with the WebArena methodology and matches the practical reality that users care about outcomes, not trajectories. Per-action accuracy is sometimes reported alongside task success in research papers to give a more granular view.

Harness design and what it adds

The OSWorld baseline harness is intentionally simple: take a screenshot, call a vision-language model with the task instruction and screenshot, parse the response into a mouse-or-keyboard action, execute, repeat. Strong 2026 scaffolds add four ingredients on top of this baseline:

First, set-of-marks (SoM) annotation: overlay numbered boxes on UI elements detected in the screenshot, so the model can reference them by index rather than coordinate. This alone typically adds 5 to 8 points to a model that previously had to specify pixel coordinates. Second, a planning loop: decompose the task into sub-goals and execute each, with re-planning on failure. Third, action verification: after each action, take a fresh screenshot and check whether the expected change happened before proceeding. Fourth, OCR over the screenshot to give the model text-content access in addition to visual grounding.

Stacking all four ingredients on top of a strong frontier vision-language model is what produces the 40-percent-plus scores in May 2026. The same underlying model in the baseline harness might score 25 percent. As with all agentic benchmarks, harness disclosure is essential for meaningful comparison.

SOTA progression Apr 2024 to May 2026

OSWorld has progressed steadily but slowly. The most-headline 2026 numbers are still around 40 percent, compared to launch baselines around 12 percent. This is roughly a 3x improvement over two years, faster than humans (constant) but slower than text-only benchmarks of similar age. The slow progress is a feature: computer use is genuinely hard, and the headroom on this benchmark is the largest of any active agent benchmark.

Date

Tier

Note

Apr 2024

Launch, GPT-4V baseline at 12.2%

Xie et al. release paper at arXiv:2404.07972; Linux-core, single-shot harness.

Oct 2024

Frontier with vision scaffold at 22%

Set-of-marks annotation overlays and planning loops add 8-10 points.

Mar 2025

Frontier closed-source at 30%

Anthropic Computer Use and OpenAI agentic models trade leadership.

Oct 2025

Strong scaffolds reach 38%

Multi-agent and self-verification scaffolds add another 5-6 points.

May 2026

Frontier 35-45% range

Range reflects harness sensitivity; humans still at ~72%.

Strengths and limitations

OSWorld's strengths are clear: real OS, real applications, real screenshots, real keyboard-and-mouse interactions, programmatic success checking, six task areas, large headroom. It is the closest public benchmark to the production conditions facing consumer-grade computer-use agents.

The main limitations: Linux-core means Windows and macOS performance is not directly measured (the extension sets help but are not always reported); the task corpus is modest (369 tasks) so per-area noise is real; the action space is more granular than most production agents need (modern computer-use models often have higher-level actions like "type into the focused field", which the benchmark partly accommodates but not uniformly); and the success functions are sensitive to small format variations that human users would not notice.

The benchmark also has a contamination concern that is harder to quantify than for text benchmarks. Vision-language models trained on screenshots scraped from the web may have seen OSWorld-adjacent UIs, including LibreOffice and GIMP layouts. The risk is not symbolic recall (the test answers are not in pre-training data) but rather familiarity with the specific UI layouts, which lifts scores on this benchmark relative to genuinely novel UIs. The wider contamination concern applies here in a softer form.

When to use OSWorld in 2026

OSWorld is the right headline benchmark for general computer-use agents. If your agent: takes screenshots, controls a real OS, operates across multiple native applications, and produces file-system or application-state changes, OSWorld is closer to your task than any other public benchmark. Quote the Linux-core 369-task number as the primary headline; quote per-area scores for credibility.

For more constrained tasks, look elsewhere: WebArena for browser-only DOM-based agents, SWE-bench Verified for engineering, Tau-Bench for tool-using customer-service dialogue, GAIA for research-assistant workflows, and Terminal-Bench for shell-only tasks. OSWorld is the union-of-everything benchmark, with the corresponding difficulty.

Editor's verdictOSWorld is the hardest agent benchmark in active use, the most realistic, and the most informative for the computer-use agent class. Quote it for any vendor claiming general desktop competence. Treat single-shot baseline numbers as floors, agentic-scaffold numbers as the headline.

Reader Questions

Q.01What is OSWorld?+

OSWorld is a benchmark for computer-use agents released by researchers at Hong Kong University, Salesforce, and Carnegie Mellon in April 2024. It contains 369 real computer-task scenarios that span Ubuntu, Windows, and (in the extended set) macOS. Tasks involve real applications: web browsers, file managers, terminals, code editors, office suites, image editors. The agent receives screenshots and controls the OS via keyboard and mouse actions. Success is checked programmatically against the file system or application state.

Q.02How is OSWorld different from WebArena or GAIA?+

OSWorld is whole-desktop scope. WebArena is browser-only, and GAIA mostly resolves via web search and file reading. OSWorld tasks routinely involve switching between several native applications: download a file in the browser, open it in a spreadsheet, paste a chart into a slide deck, save the deck. The action space (mouse and keyboard) is also lower-level than WebArena (DOM actions) or GAIA (tool calls). This is what makes OSWorld harder and more useful for evaluating general computer-use agents like Anthropic Computer Use, OpenAI Operator, or Google Project Mariner.

Q.03What is the current OSWorld SOTA?+

As of May 2026 the frontier on OSWorld sits around 35 to 45 percent overall success across the 369 tasks. The strongest agentic scaffolds with vision-grounded models reach the high 40s. This is well below the human baseline (around 72 percent) and well below WebArena scores for comparable models. OSWorld is one of the agent benchmarks with the most headroom remaining; we expect 2027 numbers to be materially higher.

Q.04Why are OSWorld scores so much lower than browser-agent scores?+

Three reasons. First, the action space is lower-level: every click, keystroke, and scroll is a separate action, and precision matters. Second, the task surface is broader: a single OSWorld task might span four distinct native applications with different UIs, whereas WebArena is one site at a time. Third, the screenshot-grounding requirement is unforgiving: misreading a button position by ten pixels can derail a 30-step trajectory. The combination is what makes OSWorld a harder, more realistic computer-use evaluation.

Q.05Does OSWorld test Windows, macOS, or only Linux?+

The core 369-task benchmark uses Ubuntu Linux as the test environment for reproducibility. The OSWorld team has released extension sets that include Windows and macOS task variants. Most published scores report the Ubuntu-core number. When a vendor claims Windows performance, ask whether they're quoting the OSWorld Windows subset specifically or a separate internal benchmark. The naming is sometimes ambiguous.

Q.06What harness is standard for OSWorld?+

The official OSWorld code provides a baseline agent that takes screenshots, calls a VLM (vision-language model), and emits mouse-and-keyboard actions. Modern scaffolds add: a planning loop, a set-of-marks annotation overlay (numbered boxes on UI elements), a memory of past actions, and OCR over the screenshot. The strongest 2026 submissions combine all four. As with WebArena, harness disclosure is essential when quoting an OSWorld number; a 35 percent score in the baseline harness and a 45 percent score with a strong scaffold could be the same underlying model.

Agent Benchmarks Overview →WebArena Methodology →Terminal-Bench →Browser-Agent Benchmarks Compared →AgentBench →SWE-bench Verified →Benchmark Contamination →

Sources

[1] Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972.
[2] OSWorld project site and leaderboard, accessed May 2026. os-world.github.io.
[3] OSWorld reproducibility repository. github.com/xlang-ai/OSWorld.