Abstract

WhatEight-environment agent eval: OS, DB, KG, DCG, LTP, HouseHold, WebShop, WebBrowsing

WhoLiu et al., Tsinghua & Microsoft Research, 2023 (arXiv:2308.03688)

2026 TierFrontier mean around 7.0 (out of 10) where reported; few fresh entries since 2025.

Section II.i · Agent Benchmarks|Last verified April 2026

AgentBench: The Multi-Environment Agent Benchmark

The first multi-environment agent benchmark with a public leaderboard. Less central in 2026 than it was in 2024, but the eight-environment framing it introduced shaped every agent benchmark that followed. Read it as foundational rather than current.

What AgentBench measures

AgentBench, introduced by Liu et al. in August 2023, was the first published benchmark that evaluated language models as autonomous agents across multiple, structurally different environments. The eight environments span an operating-system shell, a database, a knowledge graph, a card-game opponent, lateral-thinking puzzles, a simulated household, a web-shopping store, and free-form web browsing. Each environment defines its own state, action space, and success criterion, and the agent must complete a goal end to end. A trajectory either succeeds or fails; there is no partial credit.

The headline number reported on the leaderboard is the unweighted mean of per-environment success rates, scaled by environment to a 10-point scale (the original paper reports the score out of 10 with environment-specific weights). The intent was to force a model to be competent across all eight domains rather than overfit to a single style of task. In practice the mean has been criticised because environments differ in difficulty by an order of magnitude. A score of 7.0 might mean the model is genuinely strong everywhere, or strong on the easier four and merely passable elsewhere. The per-environment breakdown is the honest read.

What distinguishes AgentBench from the public benchmarks it superseded (MMLU, HumanEval, MBPP) is the requirement that the model take actions, observe results, and adapt across many turns. A single-shot answer is not enough. This is the same shift the field made from question-answering to agentic capability around 2023 and 2024, and AgentBench was the most ambitious early attempt to put a number on it.

The eight environments

Each environment has its own action space, its own observation space, and its own success criterion. The trajectories range from short (single-digit step counts in the household environment) to long (often 20-plus turns in OS and database tasks).

Code

Environment

What it tests

Operating System

Bash shell tasks: file system, permissions, package management. Hardest environment for most models because correct command syntax matters and partial credit is zero.

Database

SQL query generation against a schema the agent has to discover. Tests structured-reasoning and JOIN composition; reliably distinguishes model tiers.

Knowledge Graph

Multi-hop queries against a Wikidata-like KG. Closer to retrieval reasoning; the action space is bounded which helps weaker models.

DCG

Digital Card Game

Aquawar card-game opponent. Tests exploration and strategy; smaller models occasionally do better here because they explore more randomly.

LTP

Lateral Thinking Puzzles

Open-ended riddle solving with a judge. The most subjective environment; scores are low across the board including frontier.

Digital Home (HouseHold)

ALFWorld-derived simulated household tasks: navigate, find objects, achieve goals. Action space is small; the most game-able environment.

Web Shopping

WebShop environment: search, filter, add to cart on a simulated retailer. Reward function is task-specific and can be exploited by aggressive shopping strategies.

Web Browsing

Mind2Web-style real-page navigation, the spiritual ancestor of WebArena. Tests grounded browser action selection.

OS and Database are the most discriminating environments. They reward precise reasoning, are insensitive to brute-force search, and have well-defined success criteria. Lateral thinking, by contrast, is judged by an LLM rather than a programmatic check, which makes it sensitive to evaluator drift between submissions.

SOTA progression 2023 to 2026

AgentBench scores climbed steadily through 2024 but submission rate slowed dramatically once SWE-bench Verified (Nov 2024) and OSWorld (Apr 2024) became the headline agent benchmarks. The most recent leaderboard entries are from mid-2025; few teams currently rerun AgentBench when releasing a new model.

Date

Tier

Note

Aug 2023

GPT-4 baseline at 4.41 (overall mean)

Initial AgentBench paper, eight environments, GPT-4 sets the bar.

Nov 2023

Claude 2 around 3.0; Llama-2-70B around 1.8

Open-weight gap to frontier was about 2.5x at launch.

Mar 2024

GPT-4-Turbo around 4.8

Tool-use environments improve; OS and DB remain hardest.

Jul 2024

Frontier closed-source 5.5 to 6.5 (overall)

Mainstream model-card claim range, environment-dependent.

Apr 2025

Frontier above 7.0 with strong harness

Open-weight catches up to ~5.5; closed gap narrows.

May 2026

Newer scores rare; benchmark losing freshness

Most teams have moved to SWE-bench, WebArena, OSWorld for headline claims.

The progression illustrates a pattern visible across all agent benchmarks: closed-source frontier models open up an early lead, open-weight models catch up within roughly nine to twelve months, and the gap closes from 2.5x to about 1.3x by the time the benchmark loses freshness. AgentBench is now in the late phase: frontier closed-source still leads, but the marginal headline-number gain from running it is small enough that most labs prioritise newer benchmarks.

Methodology caveats

Three methodology caveats appear in almost every honest discussion of AgentBench scores. First, the unweighted-mean issue described above: per-environment performance varies more than the headline number suggests, and a strong overall score can mask near-zero performance on one or two environments.

Second, the harness used to expose tools to the model has a large effect on OS and DB scores. The original paper used a simple ReAct-style harness with a fixed tool inventory. More recent papers use agentic scaffolding with retrieval, planning, and self-correction loops, which can add 1.5 to 2.0 points to the overall score for a model that was already strong without them. A score from a custom harness is not comparable to a score from the original harness; the leaderboard does not always make this distinction obvious.

Third, the lateral-thinking environment uses an LLM-as-judge, which means scores depend on which model is used as judge. Submissions from 2023 and 2024 mostly used GPT-4 as judge; submissions from 2025 onwards have varied. Cross-period comparisons on this environment are noisier than the absolute numbers suggest. See our LLM-as-judge methodology page for the underlying drift mechanism.

Where AgentBench is brittle

The WebShop environment is the most game-able. The reward function rewards specific cart contents within a turn budget, which has spawned a small literature on prompt strategies that lift WebShop scores by aggressive search and filtering rather than improving genuine reasoning. The 2024 follow-up by the original authors flagged this directly.

The HouseHold environment uses ALFWorld scenarios. The action space is small enough that random exploration occasionally completes tasks, which means weaker models post higher numbers here than their general capability would predict. It is a useful capability probe (does the model understand spatial navigation in text?) but a noisy comparison signal.

The Knowledge Graph environment uses a bounded Wikidata subset, which models trained on Wikipedia and Wikidata-derived corpora have effectively seen. The contamination risk is real but harder to quantify than for MMLU or HumanEval because KG access patterns differ from training-time exposure. We treat KG scores as suggestive rather than definitive. The wider contamination concern across all agent benchmarks applies here too.

When to use AgentBench in 2026

Three use cases still make sense. First, historical comparison: if you want to position a new model against 2023 and 2024 baselines, AgentBench has wider published coverage than most newer benchmarks. Second, multi-environment capability sniff-test: the eight environments are diverse enough that strong overall performance is a real signal even if the individual scores are noisy. Third, OS and DB specifically: these two environments remain useful in their own right, and several teams now report per-environment AgentBench scores without claiming the overall.

For headline 2026 comparisons we recommend SWE-bench Verified for engineering, WebArena for browser, OSWorld for general computer use, GAIA for assistant work, and Tau-Bench for tool-use dialogue. AgentBench remains the foundational paper everyone cites; it is not the current frontline benchmark.

Editor's verdictAgentBench is the textbook foundational benchmark for agentic LLM evaluation, with a leaderboard that runs cold in 2026. Use it for historical context and per-environment capability probing. Quote SWE-bench Verified, OSWorld, GAIA, and Tau-Bench for current frontier claims.

Reader Questions

Q.01What does AgentBench actually measure?+

AgentBench measures end-to-end task completion across eight isolated environments: an operating-system shell, a database, a knowledge graph, a card-game opponent, a digital home, web shopping, web browsing, and lateral-thinking puzzles. A run scores 1 if the agent completes the goal and 0 if it does not. The headline AgentBench score is the unweighted mean of per-environment success rates, which means weak performance in one environment can dominate the overall figure even when other environments score highly.

Q.02Is AgentBench still relevant in 2026?+

AgentBench is more historically important than practically used in 2026. Most teams comparing frontier agents quote SWE-bench Verified for engineering, WebArena or OSWorld for browser and computer use, GAIA for general assistant work, and Tau-Bench for tool-use dialogue. AgentBench is still cited because it was the first multi-environment agent benchmark with a published leaderboard, and its eight-environment framing influenced everything that came after. New leaderboard submissions have slowed considerably since 2024.

Q.03Why are AgentBench scores so much lower than other benchmarks?+

AgentBench measures task completion in environments where partial credit is impossible. An agent either reaches the goal or it does not. Frontier models that score above 90 on MMLU score in the high 30s to mid 40s on AgentBench overall because the failure modes compound across multi-step tasks. A single early mis-step in a 12-step trajectory means the whole task fails. This is the same property that makes SWE-bench Verified hard, but with even less margin for recovery on most AgentBench environments.

Q.04Which AgentBench environments matter most?+

The OS-shell, web-browsing, and database environments correlate best with practical agent capability and have received the most attention. The lateral-thinking and card-game environments score poorly even for frontier models and are more useful as failure-mode probes than capability measures. The digital-home environment is the most game-able because the action space is small enough that brute search performs reasonably.

Q.05How do I compare AgentBench scores fairly?+

Always quote (a) which environments are included, (b) the harness used, and (c) the date. Some published numbers report only the average across environments where the model scored non-zero, which inflates the headline. Others report a strict mean including the worst environment. The Tsinghua leaderboard at agentbench.org documents the per-environment breakdown; rely on that rather than blog-post summaries.

Q.06Is AgentBench gameable through prompt engineering?+

Partly. The OS-shell environment is sensitive to system-prompt choices about which commands to attempt. Web shopping rewards aggressive cart-manipulation strategies that may not generalise to real e-commerce. The card-game environment is largely about exploration policy and can be improved with a few-shot game-theoretic prompt. None of these break the benchmark as a research artefact, but they make blog-post numbers hard to trust without methodology disclosure.

Agent Benchmarks Overview →SWE-bench Verified →WebArena Methodology →OSWorld Benchmark →GAIA Benchmark →Tau-Bench →What Benchmarks Miss →

Sources

[1] Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688.
[2] Tsinghua / Microsoft AgentBench leaderboard, accessed May 2026. llmbench.ai/agent.
[3] Shridhar, M. et al. (2020). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768. The HouseHold environment source.
[4] Yao, S. et al. (2022). WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv:2207.01206. The WebShop environment source.
[5] Deng, X. et al. (2023). Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070. The WebBrowsing environment source.