Independent reference. Not affiliated with OpenAI, Anthropic, Google DeepMind, Meta, Mistral, xAI, Papers With Code, HuggingFace, Langfuse, LangSmith, Braintrust, Arize, Humanloop, or HoneyHive. Scores cited with source and capture date. Affiliate disclosure.
Last verified April 2026

AI Agent Benchmarks 2026 - SWE-bench, WebArena, AgentBench, Terminal-Bench, OSWorld, Tau-Bench

Agent benchmarks measure something fundamentally different from LLM benchmarks. The question is not "can the model answer this question?" but "can the model complete this multi-step task that requires tools, state management, and error recovery?" The distinction matters because high scores on MMLU, GPQA, and HumanEval do not predict agentic capability.

A model that scores 94% on MMLU-Pro can still fail at a multi-step WebArena task that requires synthesising information across several pages, entering it correctly into a form, and verifying the result. Agentic success requires planning, tool-use discipline, error detection, and recovery - capabilities that static question-answering benchmarks do not stress. As of April 2026, SOTA on the major agent benchmarks ranges from 38% (OSWorld) to 74.5% (SWE-bench Verified), far below the levels seen on knowledge benchmarks.

SWE-bench Verified

2026 SOTA
74.5%
Claude 4.5 Opus
Captured
April 2026
Origin
Princeton + Stanford + University of Chicago, Yang et al., 2023

Construction

500 human-verified tasks drawn from real GitHub issues in 12 Python repositories (Django, SymPy, Astropy, matplotlib, and others). Each task: given the repo state, the issue, and a failing test, the agent must produce a passing patch.

Methodology

Agent has access to the repository, the issue description, and a test runner. It produces a patch. The patch is applied and the test suite runs. Success: all fail_to_pass tests pass AND all pass_to_pass tests still pass.

Strengths

Grounded in real engineering tasks. Pass/fail objective. Verified subset eliminates ambiguous tasks.

Known Limitations

Python-only. Test-flakiness exists. Solutions in git history create contamination risk. A patch that passes tests is not always production-ready.

WebArena

2026 SOTA
47.2%
Claude 4.5 Opus
Captured
April 2026
Origin
Zhou et al., 2023 - CMU, MIT, Princeton

Construction

812 tasks across 5 fully functional web applications: Reddit clone (Postmill), e-commerce (OneStopShop), collaborative coding (GitLab), maps (OpenStreetMap), and a content management system. All applications run in a sandboxed environment.

Methodology

Agent uses a browser. Tasks range from navigation (find the top-voted post this week) to transaction (buy a specific product) to information extraction. Success is evaluated programmatically via page state or API calls.

Strengths

Multi-application, realistic web tasks. Functional applications not screenshots. Task variety.

Known Limitations

Web applications are static snapshots - real web is dynamic. Some tasks have multiple valid paths not all captured. Domain limited to the 5 provided applications.

AgentBench

2026 SOTA
54.3%
GPT-5
Captured
April 2026
Origin
Liu et al., 2023 - Tsinghua University

Construction

1,091 tasks across 8 distinct agent environments: OS shell interaction, database management (MySQL/PostgreSQL), knowledge graph querying, digit card game, 2048, House-Holding (household tasks), web browsing, and web shopping.

Methodology

Each environment has its own success criterion. OS tasks: did the command achieve the goal? Database: did the query return the correct result? Games: did the agent win? Shopping: did the agent purchase the correct item at the best price?

Strengths

Broad environmental coverage. Tests different agent capabilities. OS and database tasks highly relevant for engineering agents.

Known Limitations

8 environments is still a narrow slice of real-world tasks. Game tasks (2048, card games) may not predict useful agentic capability. Dataset may have contamination in more common tasks.

Terminal-Bench

2026 SOTA
61.4%
Claude 4.5 Opus
Captured
April 2026
Origin
Brecht et al., 2024 - independent research

Construction

Tasks focused on terminal/shell-based workflows: file system operations, process management, network configuration, package management, scripting, and system debugging. Runs in isolated Docker containers.

Methodology

Agent has access to a bash terminal in a containerized environment. Tasks are evaluated by checking the resulting system state (file contents, process list, network config) against an expected outcome. No test runner - pure state evaluation.

Strengths

Highly relevant for SRE and DevOps use cases. Terminal tasks are atomic and objectively evaluable. Container isolation means reproducibility.

Known Limitations

Shell tasks vary widely in ambiguity. Some tasks have multiple correct solutions. Less published academic coverage than SWE-bench or WebArena.

OSWorld

2026 SOTA
38.1%
Claude 4.5 Opus
Captured
April 2026
Origin
Xie et al., 2024 - multiple institutions

Construction

369 tasks across 5 operating system types (Ubuntu, Windows, macOS) using real applications: web browsers (Chrome, Firefox), productivity software (LibreOffice Writer, Calc, GIMP, VLC), IDEs (VS Code), and system tools.

Methodology

Agent sees the screen (screenshot or accessibility tree). It issues actions: clicks, keystrokes, scroll. Task completion evaluated by checking the application state against a ground-truth specification.

Strengths

Covers real computer use. Not web-confined. Tests cross-application workflows. Evaluation is objective.

Known Limitations

Virtual machine infrastructure overhead. Applications and task definitions are static - real desktop environments are more dynamic. Lower SOTA than other benchmarks suggests the field is early here.

Tau-Bench

2026 SOTA
58.7%
GPT-5
Captured
April 2026
Origin
Yao et al., 2024

Construction

Tasks simulating tool-augmented agent interaction in realistic scenarios: customer service interactions, research assistance, information retrieval. Agent has access to tools (search, calculator, database lookup) and must complete multi-turn tasks.

Methodology

Evaluation is by task completion accuracy and tool-use efficiency. The agent must identify which tools to call, in what order, with what parameters. Both end-to-end success and intermediate tool-use quality are evaluated.

Strengths

Covers NLP-heavy tool-use scenarios. Customer service tasks have clear commercial relevance. Multi-turn interaction is realistic.

Known Limitations

Less physically grounded than OSWorld or WebArena. Customer service simulation may not capture adversarial user behavior. Smaller task set than AgentBench.

Comparison Table

BenchmarkDomainTasksSuccess CriterionSOTALeak Risk
SWE-bench VerifiedSoftware Engineering500Tests pass + no regression74.5%Medium (public git history)
WebArenaWeb Navigation812Page state / API check47.2%Low (sandboxed apps)
AgentBenchMulti-environment1,091Per-environment check54.3%Low-Medium
Terminal-BenchShell / DevOps~300File system state61.4%Low (Docker isolated)
OSWorldComputer Use369App state check38.1%Low (VM isolated)
Tau-BenchTool-augmented NLP~200Task completion accuracy58.7%Medium
All SOTA scores captured April 2026. Sources: official leaderboards, Papers With Code, published papers.

The Agent Benchmark Methodology Crisis

Agent benchmarks are less stable than LLM benchmarks for four structural reasons. First, environment drift: web pages change, APIs evolve, and applications update - the sandboxed environment in WebArena 2023 is not the same as the web environment a deployed agent faces in 2026. Benchmark environments are frozen; real environments are not.

Second, test flakiness: agentic tasks often have non-deterministic environments. A task that passes 70% of the time in isolation might pass 90% of the time with one random seed and 50% with another. Benchmark scores aggregate over this variance but rarely report confidence intervals.

Third, solution leakage: SWE-bench tasks come from public GitHub issues whose solutions are in the same repository's git history. Any model trained after the tasks were created may have seen the solutions. The Verified subset partially addresses this, but the risk is structural.

Fourth, and most importantly, the benchmark-to-production gap: a 70% SWE-bench Verified score does not mean the agent can handle 70% of real software engineering work. The benchmark task distribution (12 Python repos, specific issue types) does not match the distribution of a real engineering team's backlog. A model that excels at SymPy bugs may struggle with Django authentication.

Use agent benchmarks as a floor test, not a capability certificate. High scores are necessary but not sufficient.

Full editorial: what benchmarks miss ->

Frequently Asked Questions

Which agent benchmark matters most in 2026?+
SWE-bench Verified is the most widely cited and has the clearest real-world grounding - can the agent write a patch for a real GitHub issue? WebArena matters for web-navigation agents. OSWorld for computer-use agents. For NLP-heavy agents, Tau-Bench is the most relevant. The right answer depends on what your agent does.
How does SWE-bench Verified differ from original SWE-bench?+
Original SWE-bench had 2,294 tasks but was noisy - some tasks were unsolvable or under-specified, making scores unstable. SWE-bench Verified further refined to 500 human-verified tasks where each task was confirmed solvable and clearly specified. Verified is now the canonical version that serious leaderboards track.
Do agent benchmarks predict real-world agent performance?+
Partially. A 70% SWE-bench score predicts that the agent can handle certain categories of software engineering tasks in controlled conditions. But real-world agents face environment drift, error recovery requirements, and cost constraints that benchmarks do not capture. Use benchmarks as a floor, not a ceiling.
Can I run these benchmarks myself?+
Yes, but it requires non-trivial infrastructure. SWE-bench Verified requires Docker containers for each repository. WebArena requires a server running the sandboxed web applications. OSWorld requires a virtual machine environment. Plan for 2-4 weeks of engineering work to run any of these reliably.
What is OSWorld trying to measure?+
OSWorld measures whether an AI agent can control a desktop computer to complete real tasks - searching for files, editing spreadsheets, using creative software, and configuring settings. It runs in a virtual machine with real applications and evaluates task completion objectively. SOTA is around 38% as of April 2026.