Terminal-Bench: Shell Agents on Containerised Linux Tasks
The cleanest shell-agent benchmark to emerge from the 2024-2025 cohort. Docker-isolated, three difficulty tiers, six task categories, programmatic success checking. The right headline for any agent that lives in a terminal: DevOps, SRE, coding agents in their non-IDE work, security-tooling agents.
What Terminal-Bench measures
Terminal-Bench, released by Stanford and collaborators across 2024 and 2025, evaluates language-model agents on real shell tasks inside Docker containers. Each task ships with an initial container state, a natural-language instruction, and a programmatic success function. The agent connects to the container via a shell harness, runs commands, reads output, and iterates until either reaching the goal or running out of its turn budget. Success is determined by checking the container's final state: files exist, services run, configuration is valid, expected outputs match.
What separates Terminal-Bench from earlier shell benchmarks (AgentBench OS, the OSWorld terminal subset) is breadth and freshness. The task corpus covers genuine system-administration work that previous benchmarks treated as out of scope: configuring services, managing packages, debugging cron jobs, hardening SSH configuration, troubleshooting log entries. The tasks are also recently written, which keeps the benchmark resistant to memorisation from older training data that included AgentBench-derived discussion.
Terminal-Bench is also the cleanest evaluation for a specific kind of agent: the shell-living coding agent. SWE-bench Verified measures patch authorship; Terminal-Bench measures the rest of the engineering job, the parts that happen in a terminal between IDE sessions. For DevOps and SRE roles, Terminal-Bench is closer to the work than any other public benchmark. We expect it to displace AgentBench OS as the standard shell-agent reference within the next year.
Six task categories
Tasks fall into six broad categories. Each category exercises a different slice of shell competence, and per-category scoring is the honest read on capability.
The relative difficulty across categories is reasonably stable: scripting and data-manipulation tasks are easiest, security and DevOps tasks are hardest. Frontier models in May 2026 might score 75 percent on scripting and 35 percent on security in the same evaluation run, which is a real capability gap rather than benchmark noise. Per-category disclosure is essential for comparison.
Container isolation and reproducibility
Every Terminal-Bench task runs in a fresh Docker container with a defined image, initial files, and environment. This is the methodological move that makes the benchmark reproducible: the entire evaluation environment is specified in Dockerfile-and-config, and anyone running the benchmark gets exactly the same starting state.
This matters in two ways. First, it eliminates a class of environment-dependent failures that plagued earlier shell benchmarks: a script that worked in one Ubuntu version but failed in another, a package version that drifted, a config file that differed between sites. Second, it makes the benchmark resistant to overfitting via environment-specific tricks. An agent cannot exploit knowledge of a specific machine's state because the state is randomised across tasks and reset between them.
The trade-off is that the containers are slim, single-purpose Linux environments, which is not always representative of real production systems with their accumulated cruft, custom services, and historical drift. Scores on Terminal-Bench overstate readiness for real-world DevOps by some margin; we estimate the gap at around 10 to 15 points for typical infrastructure work, though this is not formally quantified.
Three difficulty tiers
Tier 1 tasks are single-tool, short-trajectory work: install a package, copy a file, create a user, set a permission. Frontier agents reach 75 to 85 percent on tier 1 in May 2026; this is the closest thing Terminal-Bench has to a saturation zone, and the gap between frontier and mid-tier models is narrow here.
Tier 2 tasks involve multi-step reasoning and basic debugging: a service that does not start, a script that produces wrong output, a process management task that requires reading systemd logs. Frontier agents score around 50 to 60 percent on tier 2 in May 2026. This is the most useful tier for comparing capable models, because the score range discriminates well across the frontier.
Tier 3 tasks are system-administration and security work with non-obvious failure modes: an SSH configuration audit, a database performance problem, a firewall rule that should permit one specific flow, a security finding that requires understanding the threat model. Frontier agents score 25 to 35 percent on tier 3 in May 2026. This is where the next two years of capability gain will be visible, and where the gap between humans (around 70 percent for competent SREs) and the best agents remains largest.
SOTA progression early 2025 to May 2026
Terminal-Bench scores have moved fast since the public release. The benchmark is fresh enough that contamination from earlier training data is relatively low, which means score gains reflect genuine capability improvement rather than memorisation. This is one of the cleanest signals in the agent-benchmark family in 2026.
Harness design
A capable Terminal-Bench harness in 2026 typically includes: a robust shell-execution wrapper that handles long output, interactive prompts, and timeouts; a planning loop that decomposes the task into sub-goals; access to documentation (man pages, configuration references) the agent can read; and a verification step that checks intermediate state before declaring success. The strongest published submissions use frameworks like Aider or SWE-agent extended with shell-specific scaffolding.
The harness sensitivity on Terminal-Bench is real but smaller than on browser benchmarks like WebArena. The same underlying model in a minimal harness and a strong harness might differ by 5 to 8 points on Terminal-Bench, compared to 20 or more on WebArena. The benchmark is closer to a pure capability test than a scaffold benchmark, which makes raw-score comparisons more meaningful.
When to use Terminal-Bench in 2026
Terminal-Bench is the right benchmark for shell-living agents: DevOps, SRE, internal tooling, coding agents in their non-IDE work, security tooling. If your agent operates primarily through a terminal and produces side effects on a real system, Terminal-Bench is closer to your task than any other benchmark in this site's coverage. Quote the overall score and the per-tier breakdown.
For other agent shapes, look elsewhere: SWE-bench Verified for software-issue patch authorship, WebArena for browser, OSWorld for general desktop, Tau-Bench for tool-using customer-service dialogue. Coding-agent benchmarks travel in pairs in 2026: SWE-bench Verified plus Terminal-Bench together give the most complete picture of a coding agent's production readiness. See our coding-agent benchmark comparison for the full landscape.
Q.01What is Terminal-Bench?+
Q.02How is Terminal-Bench different from AgentBench OS or OSWorld terminal tasks?+
Q.03What are the difficulty tiers?+
Q.04Is Terminal-Bench widely used in 2026?+
Q.05Can the same agent harness work for SWE-bench and Terminal-Bench?+
Q.06Is Terminal-Bench gameable?+
Sources
- [1] Terminal-Bench project site. terminal-bench.dev. Accessed May 2026.
- [2] Laude Institute Terminal-Bench repository. github.com/laude-institute/terminal-bench.
- [3] Aider coding-agent project for harness reference. aider.chat.