Abstract

WhatShell-agent benchmark: containerised Linux tasks across scripting, sysadmin, debugging, data, security, DevOps.

WhoStanford and collaborators, 2024-2025 (project: terminal-bench.dev)

2026 TierFrontier 55-65% overall; tier 3 (hardest) still 25-35%.

Repositorygithub.com/laude-institute/terminal-bench

Section II.vi · Agent Benchmarks|Last verified April 2026

Terminal-Bench: Shell Agents on Containerised Linux Tasks

The cleanest shell-agent benchmark to emerge from the 2024-2025 cohort. Docker-isolated, three difficulty tiers, six task categories, programmatic success checking. The right headline for any agent that lives in a terminal: DevOps, SRE, coding agents in their non-IDE work, security-tooling agents.

What Terminal-Bench measures

Terminal-Bench, released by Stanford and collaborators across 2024 and 2025, evaluates language-model agents on real shell tasks inside Docker containers. Each task ships with an initial container state, a natural-language instruction, and a programmatic success function. The agent connects to the container via a shell harness, runs commands, reads output, and iterates until either reaching the goal or running out of its turn budget. Success is determined by checking the container's final state: files exist, services run, configuration is valid, expected outputs match.

What separates Terminal-Bench from earlier shell benchmarks (AgentBench OS, the OSWorld terminal subset) is breadth and freshness. The task corpus covers genuine system-administration work that previous benchmarks treated as out of scope: configuring services, managing packages, debugging cron jobs, hardening SSH configuration, troubleshooting log entries. The tasks are also recently written, which keeps the benchmark resistant to memorisation from older training data that included AgentBench-derived discussion.

Terminal-Bench is also the cleanest evaluation for a specific kind of agent: the shell-living coding agent. SWE-bench Verified measures patch authorship; Terminal-Bench measures the rest of the engineering job, the parts that happen in a terminal between IDE sessions. For DevOps and SRE roles, Terminal-Bench is closer to the work than any other public benchmark. We expect it to displace AgentBench OS as the standard shell-agent reference within the next year.

Six task categories

Tasks fall into six broad categories. Each category exercises a different slice of shell competence, and per-category scoring is the honest read on capability.

Container isolation and reproducibility

Every Terminal-Bench task runs in a fresh Docker container with a defined image, initial files, and environment. This is the methodological move that makes the benchmark reproducible: the entire evaluation environment is specified in Dockerfile-and-config, and anyone running the benchmark gets exactly the same starting state.

This matters in two ways. First, it eliminates a class of environment-dependent failures that plagued earlier shell benchmarks: a script that worked in one Ubuntu version but failed in another, a package version that drifted, a config file that differed between sites. Second, it makes the benchmark resistant to overfitting via environment-specific tricks. An agent cannot exploit knowledge of a specific machine's state because the state is randomised across tasks and reset between them.

The trade-off is that the containers are slim, single-purpose Linux environments, which is not always representative of real production systems with their accumulated cruft, custom services, and historical drift. Scores on Terminal-Bench overstate readiness for real-world DevOps by some margin; we estimate the gap at around 10 to 15 points for typical infrastructure work, though this is not formally quantified.

Three difficulty tiers

Tier 1 tasks are single-tool, short-trajectory work: install a package, copy a file, create a user, set a permission. Frontier agents reach 75 to 85 percent on tier 1 in May 2026; this is the closest thing Terminal-Bench has to a saturation zone, and the gap between frontier and mid-tier models is narrow here.

Tier 2 tasks involve multi-step reasoning and basic debugging: a service that does not start, a script that produces wrong output, a process management task that requires reading systemd logs. Frontier agents score around 50 to 60 percent on tier 2 in May 2026. This is the most useful tier for comparing capable models, because the score range discriminates well across the frontier.

Tier 3 tasks are system-administration and security work with non-obvious failure modes: an SSH configuration audit, a database performance problem, a firewall rule that should permit one specific flow, a security finding that requires understanding the threat model. Frontier agents score 25 to 35 percent on tier 3 in May 2026. This is where the next two years of capability gain will be visible, and where the gap between humans (around 70 percent for competent SREs) and the best agents remains largest.

SOTA progression early 2025 to May 2026

Terminal-Bench scores have moved fast since the public release. The benchmark is fresh enough that contamination from earlier training data is relatively low, which means score gains reflect genuine capability improvement rather than memorisation. This is one of the cleanest signals in the agent-benchmark family in 2026.

Date

Tier

Note

Early 2025

Initial release, GPT-4o at ~30% overall

First public version with ~200 tasks across three tiers.

Aug 2025

Frontier closed-source at ~45%

Strong agentic scaffolds (Aider, SWE-agent-style) lift open-weight models.

Jan 2026

Frontier at ~55% overall (~75% tier 1, ~50% tier 2, ~25% tier 3)

Task corpus expands; success functions hardened.

May 2026

Frontier 55-65% overall; tier 3 still 25-35%

Range reflects harness sensitivity; tier 3 remains the headroom.

Harness design

A capable Terminal-Bench harness in 2026 typically includes: a robust shell-execution wrapper that handles long output, interactive prompts, and timeouts; a planning loop that decomposes the task into sub-goals; access to documentation (man pages, configuration references) the agent can read; and a verification step that checks intermediate state before declaring success. The strongest published submissions use frameworks like Aider or SWE-agent extended with shell-specific scaffolding.

The harness sensitivity on Terminal-Bench is real but smaller than on browser benchmarks like WebArena. The same underlying model in a minimal harness and a strong harness might differ by 5 to 8 points on Terminal-Bench, compared to 20 or more on WebArena. The benchmark is closer to a pure capability test than a scaffold benchmark, which makes raw-score comparisons more meaningful.

When to use Terminal-Bench in 2026

Terminal-Bench is the right benchmark for shell-living agents: DevOps, SRE, internal tooling, coding agents in their non-IDE work, security tooling. If your agent operates primarily through a terminal and produces side effects on a real system, Terminal-Bench is closer to your task than any other benchmark in this site's coverage. Quote the overall score and the per-tier breakdown.

For other agent shapes, look elsewhere: SWE-bench Verified for software-issue patch authorship, WebArena for browser, OSWorld for general desktop, Tau-Bench for tool-using customer-service dialogue. Coding-agent benchmarks travel in pairs in 2026: SWE-bench Verified plus Terminal-Bench together give the most complete picture of a coding agent's production readiness. See our coding-agent benchmark comparison for the full landscape.

Editor's verdictTerminal-Bench is the cleanest shell-agent benchmark in active use. Container isolation, six task categories, three difficulty tiers, and freshness all favour it. Quote it alongside SWE-bench Verified for any coding-agent claim. Watch tier 3 for the next 18 months of frontier progression.

Reader Questions

Q.01What is Terminal-Bench?+

Terminal-Bench is a benchmark for shell-using agents released by Stanford and collaborators in 2024-2025, hosted at terminal-bench.dev. Each task takes place in a Docker container with a specific initial state; the agent receives a natural-language instruction and must operate the shell to reach a target state. Tasks span scripting, system administration, debugging, data manipulation, and security work. Success is checked programmatically against the container's final file system or process state.

Q.02How is Terminal-Bench different from AgentBench OS or OSWorld terminal tasks?+

Three differences. First, Terminal-Bench's containers cover more system-administration breadth: real package installation, networking, services, security configurations, not just file manipulation. Second, the task corpus is larger and freshly written, not derived from older benchmarks. Third, the harness is shell-only: the agent sees terminal output and emits commands, with no screenshot grounding. This is the right benchmark for evaluating coding agents that live primarily in the shell, including any agent in a DevOps or SRE-shaped role.

Q.03What are the difficulty tiers?+

Tasks are labelled at three difficulty tiers. Tier 1 (easy): single-tool tasks like 'install package X and verify it runs'. Tier 2 (medium): multi-step tasks involving conditional logic or basic debugging. Tier 3 (hard): system-administration and security tasks that require understanding service configuration, log analysis, or non-obvious failure modes. Frontier agents score in the 70s on tier 1, 50s on tier 2, and 25-35 percent on tier 3 in May 2026.

Q.04Is Terminal-Bench widely used in 2026?+

Terminal-Bench is rising as the standard shell-agent benchmark, although it has not yet displaced SWE-bench Verified as the headline coding-agent benchmark. The two benchmarks measure different things: SWE-bench tests patch authorship for software issues; Terminal-Bench tests interactive shell competence for system tasks. Teams shipping coding agents typically quote both. The first SWE-bench Verified leaderboard submissions to also report Terminal-Bench scores appeared in late 2025.

Q.05Can the same agent harness work for SWE-bench and Terminal-Bench?+

Mostly yes. Both benchmarks expect an agent that calls shell commands, observes their output, and iterates. The differences are subtle: SWE-bench requires multi-file patch authoring with test verification; Terminal-Bench requires interactive process management and configuration. A harness designed for SWE-bench typically needs minor extensions (better long-output handling, interactive prompt handling) to perform well on Terminal-Bench. Most modern coding-agent frameworks (SWE-agent, Aider, Cursor agent mode) handle both with the same scaffold.

Q.06Is Terminal-Bench gameable?+

Moderately. Some tasks have shortcut solutions that the success function accepts but that bypass the intended challenge (e.g. a debugging task that's 'fixed' by reverting recent changes). The Stanford team has iterated on success functions to close these gaps. The bigger concern is contamination from training data: shell commands and configuration files are heavily represented in code training corpora, so a model that has seen similar problems may produce correct-looking commands without genuine reasoning. The benchmark's freshness helps but does not eliminate the risk.

Agent Benchmarks Overview →SWE-bench Verified →AgentBench →OSWorld Benchmark →Coding-Agent Benchmarks Compared →HumanEval and LiveCodeBench →What Benchmarks Miss →

Sources

[1] Terminal-Bench project site. terminal-bench.dev. Accessed May 2026.
[2] Laude Institute Terminal-Bench repository. github.com/laude-institute/terminal-bench.
[3] Aider coding-agent project for harness reference. aider.chat.