OSWorld: Computer-Use Agents on Real Operating Systems
The hardest agent benchmark in active use in 2026. Real applications, real screenshots, low-level mouse-and-keyboard actions, 369 tasks across office, browser, files, code, image, and terminal. Frontier agents still trail humans by 30 points; this is where the next two years of agent capability will be measured.
What OSWorld measures
OSWorld, introduced by Xie et al. in April 2024, evaluates agents on real computer-use tasks across an actual Ubuntu Linux desktop. The benchmark contains 369 task scenarios, each defined by an initial system state, a natural-language instruction, and a programmatic success function that checks the file system or application state after the agent finishes. Tasks span six broad areas (see below) and routinely require the agent to switch between native applications, handle real UIs, and reason over rendered screenshots rather than parsed text.
The defining choice is that OSWorld uses a real OS, not a simulated environment. The agent runs against an actual Ubuntu VM with actual Firefox, LibreOffice, GNOME Files, GIMP, and so on. Screenshots are real renders; mouse and keyboard actions are sent to real applications; failures are real (windows can crash, dialogs can appear, file dialogs can have unexpected defaults). This is the closest public benchmark to the production conditions Anthropic Computer Use, OpenAI Operator, and Google Project Mariner are trying to solve. It is also why scores are so much lower than on cleaner abstractions like WebArena.
OSWorld's second defining choice is that the action space is keyboard and mouse, not DOM actions or tool calls. The agent decides what to click and where, what to type, where to scroll. Pixel-precision matters: misreading the position of a button by ten pixels on a small UI element can derail a long trajectory. This puts OSWorld squarely in the territory of vision-language agents; text-only models cannot meaningfully attempt the benchmark.
The six task areas
The 369 tasks are distributed unevenly across six task areas, with web-and-office accounting for roughly half the corpus.
Per-area scores vary widely. File management is the easiest category (clear UI elements, deterministic outcomes) and image editing is the hardest (continuous parameter spaces, subjective output). Most frontier agents have a relative pattern across areas that is consistent across model generations: file management at the top, then browser, then office, then code, then terminal, with image editing at the bottom.
Screenshot-grounded scoring
Each OSWorld task ships with a success function that runs after the agent declares completion. The success function reads the file system or application state and checks whether the expected outcome was achieved: a file exists at a specific path, a spreadsheet cell contains the expected value, a presentation has the expected number of slides with the expected titles. The check is programmatic; there is no LLM-as-judge in the success path.
This is the same approach WebArena and GAIA take, and it is a major reason these three benchmarks are more trustworthy than agent benchmarks that rely on LLM-graded outputs. The downside is that the success function must be carefully written: a slightly mis-specified check can either over-credit (a partial completion scores success because the check is too permissive) or under-credit (a correct answer in an unexpected format scores failure). The OSWorld authors have iterated on the success functions since launch, and the third-party reproducibility tests confirm they are now broadly stable.
One nuance: a task can succeed by accident. If an agent fumbles around and happens to click the right thing, the success function will credit the trajectory. This is consistent with the WebArena methodology and matches the practical reality that users care about outcomes, not trajectories. Per-action accuracy is sometimes reported alongside task success in research papers to give a more granular view.
Harness design and what it adds
The OSWorld baseline harness is intentionally simple: take a screenshot, call a vision-language model with the task instruction and screenshot, parse the response into a mouse-or-keyboard action, execute, repeat. Strong 2026 scaffolds add four ingredients on top of this baseline:
First, set-of-marks (SoM) annotation: overlay numbered boxes on UI elements detected in the screenshot, so the model can reference them by index rather than coordinate. This alone typically adds 5 to 8 points to a model that previously had to specify pixel coordinates. Second, a planning loop: decompose the task into sub-goals and execute each, with re-planning on failure. Third, action verification: after each action, take a fresh screenshot and check whether the expected change happened before proceeding. Fourth, OCR over the screenshot to give the model text-content access in addition to visual grounding.
Stacking all four ingredients on top of a strong frontier vision-language model is what produces the 40-percent-plus scores in May 2026. The same underlying model in the baseline harness might score 25 percent. As with all agentic benchmarks, harness disclosure is essential for meaningful comparison.
SOTA progression Apr 2024 to May 2026
OSWorld has progressed steadily but slowly. The most-headline 2026 numbers are still around 40 percent, compared to launch baselines around 12 percent. This is roughly a 3x improvement over two years, faster than humans (constant) but slower than text-only benchmarks of similar age. The slow progress is a feature: computer use is genuinely hard, and the headroom on this benchmark is the largest of any active agent benchmark.
Strengths and limitations
OSWorld's strengths are clear: real OS, real applications, real screenshots, real keyboard-and-mouse interactions, programmatic success checking, six task areas, large headroom. It is the closest public benchmark to the production conditions facing consumer-grade computer-use agents.
The main limitations: Linux-core means Windows and macOS performance is not directly measured (the extension sets help but are not always reported); the task corpus is modest (369 tasks) so per-area noise is real; the action space is more granular than most production agents need (modern computer-use models often have higher-level actions like "type into the focused field", which the benchmark partly accommodates but not uniformly); and the success functions are sensitive to small format variations that human users would not notice.
The benchmark also has a contamination concern that is harder to quantify than for text benchmarks. Vision-language models trained on screenshots scraped from the web may have seen OSWorld-adjacent UIs, including LibreOffice and GIMP layouts. The risk is not symbolic recall (the test answers are not in pre-training data) but rather familiarity with the specific UI layouts, which lifts scores on this benchmark relative to genuinely novel UIs. The wider contamination concern applies here in a softer form.
When to use OSWorld in 2026
OSWorld is the right headline benchmark for general computer-use agents. If your agent: takes screenshots, controls a real OS, operates across multiple native applications, and produces file-system or application-state changes, OSWorld is closer to your task than any other public benchmark. Quote the Linux-core 369-task number as the primary headline; quote per-area scores for credibility.
For more constrained tasks, look elsewhere: WebArena for browser-only DOM-based agents, SWE-bench Verified for engineering, Tau-Bench for tool-using customer-service dialogue, GAIA for research-assistant workflows, and Terminal-Bench for shell-only tasks. OSWorld is the union-of-everything benchmark, with the corresponding difficulty.
Q.01What is OSWorld?+
Q.02How is OSWorld different from WebArena or GAIA?+
Q.03What is the current OSWorld SOTA?+
Q.04Why are OSWorld scores so much lower than browser-agent scores?+
Q.05Does OSWorld test Windows, macOS, or only Linux?+
Q.06What harness is standard for OSWorld?+
Sources
- [1] Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972.
- [2] OSWorld project site and leaderboard, accessed May 2026. os-world.github.io.
- [3] OSWorld reproducibility repository. github.com/xlang-ai/OSWorld.