Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
What369 real-OS tasks across browser, office, file management, code editor, image editor, terminal
WhoXie et al., HKU + Salesforce + CMU, 2024 (arXiv:2404.07972)
BaselinesGPT-4V at 12.2% in the launch paper (Linux-core); human baseline ~72%.
Leaderboardos-world.github.io
Section II.v · Agent Benchmarks|Reviewed 2026

OSWorld: Computer-Use Agents on Real Operating Systems

The hardest agent benchmark in active use in 2026. Real applications, real screenshots, low-level mouse-and-keyboard actions, 369 tasks across office, browser, files, code, image, and terminal. Frontier agents still trail humans by 30 points; this is where the next two years of agent capability will be measured.

01

What OSWorld measures

OSWorld, introduced by Xie et al. in April 2024, evaluates agents on real computer-use tasks across an actual Ubuntu Linux desktop. The benchmark contains 369 task scenarios, each defined by an initial system state, a natural-language instruction, and a programmatic success function that checks the file system or application state after the agent finishes. Tasks span six broad areas (see below) and routinely require the agent to switch between native applications, handle real UIs, and reason over rendered screenshots rather than parsed text.

The defining choice is that OSWorld uses a real OS, not a simulated environment. The agent runs against an actual Ubuntu VM with actual Firefox, LibreOffice, GNOME Files, GIMP, and so on. Screenshots are real renders; mouse and keyboard actions are sent to real applications; failures are real (windows can crash, dialogs can appear, file dialogs can have unexpected defaults). This is the closest public benchmark to the production conditions Anthropic Computer Use, OpenAI Operator, and Google Project Mariner are trying to solve. It is also why scores are so much lower than on cleaner abstractions like WebArena.

OSWorld's second defining choice is that the action space is keyboard and mouse, not DOM actions or tool calls. The agent decides what to click and where, what to type, where to scroll. Pixel-precision matters: misreading the position of a button by ten pixels on a small UI element can derail a long trajectory. This puts OSWorld squarely in the territory of vision-language agents; text-only models cannot meaningfully attempt the benchmark.

02

The six task areas

The 369 tasks are distributed unevenly across six task areas, with web-and-office accounting for roughly half the corpus.

Task area
What it tests
Office productivity
LibreOffice Writer/Calc/Impress tasks: edit documents, build spreadsheets, compose presentations from given data. Sensitive to UI navigation precision.
Web + browser
Firefox or Chrome tasks: similar to WebArena but with native browser navigation rather than DOM access. Tests visual grounding more than HTML parsing.
File management
GNOME Files / Nautilus: locate, move, rename, archive, extract. The most reliable signal of basic computer-use competence.
Code editing
VS Code tasks: open project, edit specific file, save, run. Overlaps with SWE-bench in spirit but tests the IDE workflow rather than the patch.
Image editing
GIMP tasks: crop, resize, adjust, export. The most VLM-heavy category; vision capability dominates.
Terminal
Bash tasks: navigate, run scripts, manage processes. Closest to AgentBench OS environment but with a real terminal UI rather than a clean tool interface.

Per-area scores vary widely. File management is the easiest category (clear UI elements, deterministic outcomes) and image editing is the hardest (continuous parameter spaces, subjective output). Most frontier agents have a relative pattern across areas that is consistent across model generations: file management at the top, then browser, then office, then code, then terminal, with image editing at the bottom.

03

Screenshot-grounded scoring

Each OSWorld task ships with a success function that runs after the agent declares completion. The success function reads the file system or application state and checks whether the expected outcome was achieved: a file exists at a specific path, a spreadsheet cell contains the expected value, a presentation has the expected number of slides with the expected titles. The check is programmatic; there is no LLM-as-judge in the success path.

This is the same approach WebArena and GAIA take, and it is a major reason these three benchmarks are more trustworthy than agent benchmarks that rely on LLM-graded outputs. The downside is that the success function must be carefully written: a slightly mis-specified check can either over-credit (a partial completion scores success because the check is too permissive) or under-credit (a correct answer in an unexpected format scores failure). The OSWorld authors have iterated on the success functions since launch, and the third-party reproducibility tests confirm they are now broadly stable.

One nuance: a task can succeed by accident. If an agent fumbles around and happens to click the right thing, the success function will credit the trajectory. This is consistent with the WebArena methodology and matches the practical reality that users care about outcomes, not trajectories. Per-action accuracy is sometimes reported alongside task success in research papers to give a more granular view.

04

Harness design and what it adds

The OSWorld baseline harness is intentionally simple: take a screenshot, call a vision-language model with the task instruction and screenshot, parse the response into a mouse-or-keyboard action, execute, repeat. Strong 2026 scaffolds add four ingredients on top of this baseline:

First, set-of-marks (SoM) annotation: overlay numbered boxes on UI elements detected in the screenshot, so the model can reference them by index rather than coordinate. This alone typically adds 5 to 8 points to a model that previously had to specify pixel coordinates. Second, a planning loop: decompose the task into sub-goals and execute each, with re-planning on failure. Third, action verification: after each action, take a fresh screenshot and check whether the expected change happened before proceeding. Fourth, OCR over the screenshot to give the model text-content access in addition to visual grounding.

Stacking all four ingredients on top of a strong frontier vision-language model is what separates the leading submissions from the baseline. The same underlying model can score very differently in the baseline harness versus a strong scaffold. As with all agentic benchmarks, harness disclosure is essential for meaningful comparison.

05

Reading the scores

The original OSWorld paper reported a GPT-4V agent at 12.2% on the Linux-core 369 tasks against a human baseline of roughly 72%, which is the one cleanly-sourced anchor for the benchmark. Scores have climbed since, but progress is slow because computer use is genuinely hard, and OSWorld has the largest remaining headroom of any active agent benchmark. We do not reprint a per-model progression table: the leaderboard moves continuously, scores swing on harness choice, and vendor self-reports are not comparable to public submissions.

For current results, read the official OSWorld leaderboard and the harness used by each submission together. To choose the right benchmark for your use case rather than chase a single figure, start from the homepage task picker. Quote the scaffold and whether the number is Linux-core whenever you cite an OSWorld score.

06

Strengths and limitations

OSWorld's strengths are clear: real OS, real applications, real screenshots, real keyboard-and-mouse interactions, programmatic success checking, six task areas, large headroom. It is the closest public benchmark to the production conditions facing consumer-grade computer-use agents.

The main limitations: Linux-core means Windows and macOS performance is not directly measured (the extension sets help but are not always reported); the task corpus is modest (369 tasks) so per-area noise is real; the action space is more granular than most production agents need (modern computer-use models often have higher-level actions like "type into the focused field", which the benchmark partly accommodates but not uniformly); and the success functions are sensitive to small format variations that human users would not notice.

The benchmark also has a contamination concern that is harder to quantify than for text benchmarks. Vision-language models trained on screenshots scraped from the web may have seen OSWorld-adjacent UIs, including LibreOffice and GIMP layouts. The risk is not symbolic recall (the test answers are not in pre-training data) but rather familiarity with the specific UI layouts, which lifts scores on this benchmark relative to genuinely novel UIs. The wider contamination concern applies here in a softer form.

07

When to use OSWorld in 2026

OSWorld is the right headline benchmark for general computer-use agents. If your agent: takes screenshots, controls a real OS, operates across multiple native applications, and produces file-system or application-state changes, OSWorld is closer to your task than any other public benchmark. Quote the Linux-core 369-task number as the primary headline; quote per-area scores for credibility.

For more constrained tasks, look elsewhere: WebArena for browser-only DOM-based agents, SWE-bench Verified for engineering, Tau-Bench for tool-using customer-service dialogue, GAIA for research-assistant workflows, and Terminal-Bench for shell-only tasks. OSWorld is the union-of-everything benchmark, with the corresponding difficulty.

Editor's verdictOSWorld is the hardest agent benchmark in active use, the most realistic, and the most informative for the computer-use agent class. Quote it for any vendor claiming general desktop competence. Treat single-shot baseline numbers as floors, agentic-scaffold numbers as the headline.
Reader Questions
Q.01What is OSWorld?+
OSWorld is a benchmark for computer-use agents released by researchers at Hong Kong University, Salesforce, and Carnegie Mellon in April 2024. It contains 369 real computer-task scenarios that span Ubuntu, Windows, and (in the extended set) macOS. Tasks involve real applications: web browsers, file managers, terminals, code editors, office suites, image editors. The agent receives screenshots and controls the OS via keyboard and mouse actions. Success is checked programmatically against the file system or application state.
Q.02How is OSWorld different from WebArena or GAIA?+
OSWorld is whole-desktop scope. WebArena is browser-only, and GAIA mostly resolves via web search and file reading. OSWorld tasks routinely involve switching between several native applications: download a file in the browser, open it in a spreadsheet, paste a chart into a slide deck, save the deck. The action space (mouse and keyboard) is also lower-level than WebArena (DOM actions) or GAIA (tool calls). This is what makes OSWorld harder and more useful for evaluating general computer-use agents like Anthropic Computer Use, OpenAI Operator, or Google Project Mariner.
Q.03Where can I see current OSWorld scores?+
The live ranking is the official OSWorld leaderboard at os-world.github.io. We do not reprint a per-model table here: that board moves continuously, scores depend heavily on the harness (a baseline single-shot agent and a strong agentic scaffold can be the same underlying model), and vendor self-reports are not comparable to public submissions. Whatever the headline, OSWorld success rates remain well below the human baseline (around 72 percent in the original paper) and well below WebArena scores for comparable agents. OSWorld is one of the agent benchmarks with the most headroom remaining.
Q.04Why are OSWorld scores so much lower than browser-agent scores?+
Three reasons. First, the action space is lower-level: every click, keystroke, and scroll is a separate action, and precision matters. Second, the task surface is broader: a single OSWorld task might span four distinct native applications with different UIs, whereas WebArena is one site at a time. Third, the screenshot-grounding requirement is unforgiving: misreading a button position by ten pixels can derail a 30-step trajectory. The combination is what makes OSWorld a harder, more realistic computer-use evaluation.
Q.05Does OSWorld test Windows, macOS, or only Linux?+
The core 369-task benchmark uses Ubuntu Linux as the test environment for reproducibility. The OSWorld team has released extension sets that include Windows and macOS task variants. Most published scores report the Ubuntu-core number. When a vendor claims Windows performance, ask whether they're quoting the OSWorld Windows subset specifically or a separate internal benchmark. The naming is sometimes ambiguous.
Q.06What harness is standard for OSWorld?+
The official OSWorld code provides a baseline agent that takes screenshots, calls a VLM (vision-language model), and emits mouse-and-keyboard actions. Modern scaffolds add: a planning loop, a set-of-marks annotation overlay (numbered boxes on UI elements), a memory of past actions, and OCR over the screenshot. The strongest submissions combine all four. As with WebArena, harness disclosure is essential when quoting an OSWorld number: the same underlying model can score very differently in a baseline single-shot harness versus a strong agentic scaffold, so a score without its harness is close to meaningless.
Agent Benchmarks OverviewWebArena MethodologyTerminal-BenchBrowser-Agent Benchmarks ComparedAgentBenchSWE-bench VerifiedBenchmark Contamination

Sources

  1. [1] Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972.
  2. [2] OSWorld project site and leaderboard. os-world.github.io.
  3. [3] OSWorld reproducibility repository. github.com/xlang-ai/OSWorld.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.