AgentBench: The Multi-Environment Agent Benchmark
The first multi-environment agent benchmark with a public leaderboard. Less central in 2026 than it was in 2024, but the eight-environment framing it introduced shaped every agent benchmark that followed. Read it as foundational rather than current.
What AgentBench measures
AgentBench, introduced by Liu et al. in August 2023, was the first published benchmark that evaluated language models as autonomous agents across multiple, structurally different environments. The eight environments span an operating-system shell, a database, a knowledge graph, a card-game opponent, lateral-thinking puzzles, a simulated household, a web-shopping store, and free-form web browsing. Each environment defines its own state, action space, and success criterion, and the agent must complete a goal end to end. A trajectory either succeeds or fails; there is no partial credit.
The headline number reported on the leaderboard is the unweighted mean of per-environment success rates, scaled by environment to a 10-point scale (the original paper reports the score out of 10 with environment-specific weights). The intent was to force a model to be competent across all eight domains rather than overfit to a single style of task. In practice the mean has been criticised because environments differ in difficulty by an order of magnitude. A score of 7.0 might mean the model is genuinely strong everywhere, or strong on the easier four and merely passable elsewhere. The per-environment breakdown is the honest read.
What distinguishes AgentBench from the public benchmarks it superseded (MMLU, HumanEval, MBPP) is the requirement that the model take actions, observe results, and adapt across many turns. A single-shot answer is not enough. This is the same shift the field made from question-answering to agentic capability around 2023 and 2024, and AgentBench was the most ambitious early attempt to put a number on it.
The eight environments
Each environment has its own action space, its own observation space, and its own success criterion. The trajectories range from short (single-digit step counts in the household environment) to long (often 20-plus turns in OS and database tasks).
OS and Database are the most discriminating environments. They reward precise reasoning, are insensitive to brute-force search, and have well-defined success criteria. Lateral thinking, by contrast, is judged by an LLM rather than a programmatic check, which makes it sensitive to evaluator drift between submissions.
SOTA progression 2023 to 2026
AgentBench scores climbed steadily through 2024 but submission rate slowed dramatically once SWE-bench Verified (Nov 2024) and OSWorld (Apr 2024) became the headline agent benchmarks. The most recent leaderboard entries are from mid-2025; few teams currently rerun AgentBench when releasing a new model.
The progression illustrates a pattern visible across all agent benchmarks: closed-source frontier models open up an early lead, open-weight models catch up within roughly nine to twelve months, and the gap closes from 2.5x to about 1.3x by the time the benchmark loses freshness. AgentBench is now in the late phase: frontier closed-source still leads, but the marginal headline-number gain from running it is small enough that most labs prioritise newer benchmarks.
Methodology caveats
Three methodology caveats appear in almost every honest discussion of AgentBench scores. First, the unweighted-mean issue described above: per-environment performance varies more than the headline number suggests, and a strong overall score can mask near-zero performance on one or two environments.
Second, the harness used to expose tools to the model has a large effect on OS and DB scores. The original paper used a simple ReAct-style harness with a fixed tool inventory. More recent papers use agentic scaffolding with retrieval, planning, and self-correction loops, which can add 1.5 to 2.0 points to the overall score for a model that was already strong without them. A score from a custom harness is not comparable to a score from the original harness; the leaderboard does not always make this distinction obvious.
Third, the lateral-thinking environment uses an LLM-as-judge, which means scores depend on which model is used as judge. Submissions from 2023 and 2024 mostly used GPT-4 as judge; submissions from 2025 onwards have varied. Cross-period comparisons on this environment are noisier than the absolute numbers suggest. See our LLM-as-judge methodology page for the underlying drift mechanism.
Where AgentBench is brittle
The WebShop environment is the most game-able. The reward function rewards specific cart contents within a turn budget, which has spawned a small literature on prompt strategies that lift WebShop scores by aggressive search and filtering rather than improving genuine reasoning. The 2024 follow-up by the original authors flagged this directly.
The HouseHold environment uses ALFWorld scenarios. The action space is small enough that random exploration occasionally completes tasks, which means weaker models post higher numbers here than their general capability would predict. It is a useful capability probe (does the model understand spatial navigation in text?) but a noisy comparison signal.
The Knowledge Graph environment uses a bounded Wikidata subset, which models trained on Wikipedia and Wikidata-derived corpora have effectively seen. The contamination risk is real but harder to quantify than for MMLU or HumanEval because KG access patterns differ from training-time exposure. We treat KG scores as suggestive rather than definitive. The wider contamination concern across all agent benchmarks applies here too.
When to use AgentBench in 2026
Three use cases still make sense. First, historical comparison: if you want to position a new model against 2023 and 2024 baselines, AgentBench has wider published coverage than most newer benchmarks. Second, multi-environment capability sniff-test: the eight environments are diverse enough that strong overall performance is a real signal even if the individual scores are noisy. Third, OS and DB specifically: these two environments remain useful in their own right, and several teams now report per-environment AgentBench scores without claiming the overall.
For headline 2026 comparisons we recommend SWE-bench Verified for engineering, WebArena for browser, OSWorld for general computer use, GAIA for assistant work, and Tau-Bench for tool-use dialogue. AgentBench remains the foundational paper everyone cites; it is not the current frontline benchmark.
Q.01What does AgentBench actually measure?+
Q.02Is AgentBench still relevant in 2026?+
Q.03Why are AgentBench scores so much lower than other benchmarks?+
Q.04Which AgentBench environments matter most?+
Q.05How do I compare AgentBench scores fairly?+
Q.06Is AgentBench gameable through prompt engineering?+
Sources
- [1] Liu, X. et al. (2023). AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688.
- [2] Tsinghua / Microsoft AgentBench leaderboard, accessed May 2026. llmbench.ai/agent.
- [3] Shridhar, M. et al. (2020). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. arXiv:2010.03768. The HouseHold environment source.
- [4] Yao, S. et al. (2022). WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. arXiv:2207.01206. The WebShop environment source.
- [5] Deng, X. et al. (2023). Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070. The WebBrowsing environment source.