OSWorld: Computer-Use Agents on Real Operating Systems
The hardest agent benchmark in active use in 2026. Real applications, real screenshots, low-level mouse-and-keyboard actions, 369 tasks across office, browser, files, code, image, and terminal. Frontier agents still trail humans by 30 points; this is where the next two years of agent capability will be measured.
What OSWorld measures
OSWorld, introduced by Xie et al. in April 2024, evaluates agents on real computer-use tasks across an actual Ubuntu Linux desktop. The benchmark contains 369 task scenarios, each defined by an initial system state, a natural-language instruction, and a programmatic success function that checks the file system or application state after the agent finishes. Tasks span six broad areas (see below) and routinely require the agent to switch between native applications, handle real UIs, and reason over rendered screenshots rather than parsed text.
The defining choice is that OSWorld uses a real OS, not a simulated environment. The agent runs against an actual Ubuntu VM with actual Firefox, LibreOffice, GNOME Files, GIMP, and so on. Screenshots are real renders; mouse and keyboard actions are sent to real applications; failures are real (windows can crash, dialogs can appear, file dialogs can have unexpected defaults). This is the closest public benchmark to the production conditions Anthropic Computer Use, OpenAI Operator, and Google Project Mariner are trying to solve. It is also why scores are so much lower than on cleaner abstractions like WebArena.
OSWorld's second defining choice is that the action space is keyboard and mouse, not DOM actions or tool calls. The agent decides what to click and where, what to type, where to scroll. Pixel-precision matters: misreading the position of a button by ten pixels on a small UI element can derail a long trajectory. This puts OSWorld squarely in the territory of vision-language agents; text-only models cannot meaningfully attempt the benchmark.
The six task areas
The 369 tasks are distributed unevenly across six task areas, with web-and-office accounting for roughly half the corpus.
Per-area scores vary widely. File management is the easiest category (clear UI elements, deterministic outcomes) and image editing is the hardest (continuous parameter spaces, subjective output). Most frontier agents have a relative pattern across areas that is consistent across model generations: file management at the top, then browser, then office, then code, then terminal, with image editing at the bottom.
Screenshot-grounded scoring
Each OSWorld task ships with a success function that runs after the agent declares completion. The success function reads the file system or application state and checks whether the expected outcome was achieved: a file exists at a specific path, a spreadsheet cell contains the expected value, a presentation has the expected number of slides with the expected titles. The check is programmatic; there is no LLM-as-judge in the success path.
This is the same approach WebArena and GAIA take, and it is a major reason these three benchmarks are more trustworthy than agent benchmarks that rely on LLM-graded outputs. The downside is that the success function must be carefully written: a slightly mis-specified check can either over-credit (a partial completion scores success because the check is too permissive) or under-credit (a correct answer in an unexpected format scores failure). The OSWorld authors have iterated on the success functions since launch, and the third-party reproducibility tests confirm they are now broadly stable.
One nuance: a task can succeed by accident. If an agent fumbles around and happens to click the right thing, the success function will credit the trajectory. This is consistent with the WebArena methodology and matches the practical reality that users care about outcomes, not trajectories. Per-action accuracy is sometimes reported alongside task success in research papers to give a more granular view.
Harness design and what it adds
The OSWorld baseline harness is intentionally simple: take a screenshot, call a vision-language model with the task instruction and screenshot, parse the response into a mouse-or-keyboard action, execute, repeat. Strong 2026 scaffolds add four ingredients on top of this baseline:
First, set-of-marks (SoM) annotation: overlay numbered boxes on UI elements detected in the screenshot, so the model can reference them by index rather than coordinate. This alone typically adds 5 to 8 points to a model that previously had to specify pixel coordinates. Second, a planning loop: decompose the task into sub-goals and execute each, with re-planning on failure. Third, action verification: after each action, take a fresh screenshot and check whether the expected change happened before proceeding. Fourth, OCR over the screenshot to give the model text-content access in addition to visual grounding.
Stacking all four ingredients on top of a strong frontier vision-language model is what separates the leading submissions from the baseline. The same underlying model can score very differently in the baseline harness versus a strong scaffold. As with all agentic benchmarks, harness disclosure is essential for meaningful comparison.
Reading the scores
The original OSWorld paper reported a GPT-4V agent at 12.2% on the Linux-core 369 tasks against a human baseline of roughly 72%, which is the one cleanly-sourced anchor for the benchmark. Scores have climbed since, but progress is slow because computer use is genuinely hard, and OSWorld has the largest remaining headroom of any active agent benchmark. We do not reprint a per-model progression table: the leaderboard moves continuously, scores swing on harness choice, and vendor self-reports are not comparable to public submissions.
For current results, read the official OSWorld leaderboard and the harness used by each submission together. To choose the right benchmark for your use case rather than chase a single figure, start from the homepage task picker. Quote the scaffold and whether the number is Linux-core whenever you cite an OSWorld score.
Strengths and limitations
OSWorld's strengths are clear: real OS, real applications, real screenshots, real keyboard-and-mouse interactions, programmatic success checking, six task areas, large headroom. It is the closest public benchmark to the production conditions facing consumer-grade computer-use agents.
The main limitations: Linux-core means Windows and macOS performance is not directly measured (the extension sets help but are not always reported); the task corpus is modest (369 tasks) so per-area noise is real; the action space is more granular than most production agents need (modern computer-use models often have higher-level actions like "type into the focused field", which the benchmark partly accommodates but not uniformly); and the success functions are sensitive to small format variations that human users would not notice.
The benchmark also has a contamination concern that is harder to quantify than for text benchmarks. Vision-language models trained on screenshots scraped from the web may have seen OSWorld-adjacent UIs, including LibreOffice and GIMP layouts. The risk is not symbolic recall (the test answers are not in pre-training data) but rather familiarity with the specific UI layouts, which lifts scores on this benchmark relative to genuinely novel UIs. The wider contamination concern applies here in a softer form.
When to use OSWorld in 2026
OSWorld is the right headline benchmark for general computer-use agents. If your agent: takes screenshots, controls a real OS, operates across multiple native applications, and produces file-system or application-state changes, OSWorld is closer to your task than any other public benchmark. Quote the Linux-core 369-task number as the primary headline; quote per-area scores for credibility.
For more constrained tasks, look elsewhere: WebArena for browser-only DOM-based agents, SWE-bench Verified for engineering, Tau-Bench for tool-using customer-service dialogue, GAIA for research-assistant workflows, and Terminal-Bench for shell-only tasks. OSWorld is the union-of-everything benchmark, with the corresponding difficulty.
Q.01What is OSWorld?+
Q.02How is OSWorld different from WebArena or GAIA?+
Q.03Where can I see current OSWorld scores?+
Q.04Why are OSWorld scores so much lower than browser-agent scores?+
Q.05Does OSWorld test Windows, macOS, or only Linux?+
Q.06What harness is standard for OSWorld?+
Sources
- [1] Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972.
- [2] OSWorld project site and leaderboard. os-world.github.io.
- [3] OSWorld reproducibility repository. github.com/xlang-ai/OSWorld.