Abstract

WhatStateful directed-graph agent framework from LangChain; the recommended LangChain pattern for multi-step agents in 2026

WhoLangChain Inc., released early 2024 (langchain-ai.github.io/langgraph)

Bench TierStrong open framework. SWE-bench Verified 35-55%, GAIA 40-55%, WebArena 30-45%.

Repositorygithub.com/langchain-ai/langgraph

Section II.vii · Agent Frameworks|Last verified April 2026

LangGraph Benchmarks: Stateful Graphs for Multi-Step Agents

The LangChain successor to ReAct-style agent loops. Models agent workflows as directed graphs of state transitions, with explicit branching, conditional re-entry, and long-lived loops. Where it shines, where it does not, and what numbers community submissions land on across the major agent benchmarks.

What LangGraph is

LangGraph is the agent-orchestration framework released by LangChain in early 2024. It models agent workflows as directed graphs: nodes are functions or LLM calls, edges are conditional transitions, and the runtime tracks accumulated state across the graph. The model is more expressive than the earlier AgentExecutor pattern (which essentially ran a ReAct-style loop with limited state management) and has become the LangChain ecosystem's recommended pattern for building production-grade multi-step agents in 2026.

The framework's defining advantage is explicit state. A LangGraph application defines a typed state object that flows through the graph; each node reads and writes specific fields. This makes it possible to build agents that branch on intermediate results, retry specific sub-tasks without losing earlier work, and maintain long-lived loops with bounded memory. These patterns are awkward to express in pure ReAct loops and are central to many production agent designs.

LangGraph integrates with the wider LangChain ecosystem: tracing through LangSmith, deployment through LangGraph Cloud, human-in-the-loop interrupts, persistent state via checkpoint stores. For teams already invested in LangChain, LangGraph is the natural progression. For teams starting fresh, the choice is between LangGraph, AutoGen, CrewAI, OpenAI Agents SDK, and a handful of smaller frameworks; LangGraph is the most popular but not always the best fit.

Benchmark coverage

LangGraph itself does not publish a single official benchmark score; the framework's performance depends on the underlying model, the graph design, and the tools made available. Community submissions to public leaderboards (SWE-bench Verified, GAIA, WebArena, Tau-Bench) include LangGraph-using configurations, and LangChain's own blog posts publish reference numbers. The picture below is assembled from community submissions and LangChain blog disclosures, not from a single official source.

Benchmark

Reported tier

Note

SWE-bench Verified

35-55% (community submissions)

Strong configurations reach high 50s with frontier models. Trails the proprietary-scaffold frontier of low-70s but competitive among open frameworks.

GAIA

40-55% overall (Level 1+2 leading)

Community submissions; assistant-style work where graph structure helps with planning and tool sequencing.

WebArena (with BrowserGym)

30-45%

BrowserGym integrates with LangGraph for browser tasks. Strong scaffolds with vision grounding hit high 40s.

Tau-Bench (retail)

50-65% pass@1

Multi-turn dialogue work suits the graph state model. Airline scores 35-45%.

HotPotQA / MultiHopRAG

70-85% (LangChain published)

LangChain's own RAG-pattern reference graphs perform well on multi-hop QA.

The general pattern: LangGraph submissions sit in the strong-but-not-top tier of open agent frameworks. They trail the proprietary scaffolds from Anthropic and OpenAI on engineering benchmarks like SWE-bench Verified by 15-25 points, but they are competitive with other open frameworks (SWE-agent, Aider, CrewAI) and outperform plain ReAct-style harnesses by a meaningful margin on structured workflows.

SWE-bench Verified performance

SWE-bench Verified is the most-watched coding-agent benchmark in 2026, and LangGraph submissions on the public leaderboard land in the 35-55 percent range depending on the underlying model and graph design. The strongest published LangGraph SWE-bench configurations use frontier models (Claude 4-class or GPT-5-class), explicit issue-decomposition graphs (separate nodes for issue understanding, code search, patch authoring, test verification, and self-correction), and access to the same tool inventory as SWE-agent (file search, terminal, test runner, editor).

The 15-25 point gap to the absolute frontier (low-to-mid 70s) reflects two things. First, the proprietary scaffolds from Anthropic and OpenAI have been tuned specifically for SWE-bench Verified across many iterations; LangGraph is general-purpose and has not been over-fit to this single benchmark. Second, the LangGraph submissions usually use simpler graphs than the proprietary scaffolds; the latter sometimes include best-of-n inference, extended reasoning budgets, and multi-agent verification loops that the open submissions do not match.

The practical implication is that LangGraph is a credible production choice for coding agents but is not the way to win a SWE-bench Verified leaderboard race. Teams that need 60+ percent on SWE-bench typically use a more bespoke harness; teams that need 50 percent on SWE-bench plus deployment, tracing, and human-in-the-loop tooling typically use LangGraph. See our SWE-bench Verified deep dive for the wider context and our coding-agent benchmark comparison for framework alternatives.

GAIA performance

GAIA, the general-assistant benchmark from Meta, has seen growing LangGraph adoption since 2024. The graph-state model fits naturally with multi-step assistant workflows: planning nodes decompose the task, retrieval nodes gather information from web search or files, verification nodes check intermediate answers, and a synthesis node produces the final response. Published LangGraph GAIA configurations score 40-55 percent overall in May 2026, with stronger performance on Level 1 and Level 2 than on Level 3.

The Level 3 ceiling (around 25-35 percent for LangGraph submissions) reflects the same headroom seen across all frameworks: the level-3 questions require multi-source research with synthesis and verification across long horizons, where any framework will struggle without strong reasoning models and careful tool integration. LangGraph's state model helps but does not fundamentally change the difficulty.

Browser and computer-use benchmarks

For browser work, LangGraph integrates with BrowserGym and similar libraries to provide a structured action space over rendered HTML. Published configurations score 30-45 percent on WebArena depending on the underlying model and whether vision grounding is included. The LangGraph + BrowserGym + frontier model combination is competitive with custom browser-agent scaffolds but trails the absolute frontier achieved by purpose-built browser agents like Anthropic Computer Use or OpenAI Operator.

For OSWorld-style computer-use benchmarks, LangGraph is less commonly used than vendor-native scaffolds. The lower-level keyboard-and-mouse action space is awkward to express as graph nodes; community submissions are sparse. We have not seen credible OSWorld scores using LangGraph; the framework is a poor fit for that workload, and that is a reasonable design call rather than a LangGraph weakness.

When LangGraph helps and when it does not

LangGraph adds the most value when (a) the agent workflow has meaningful structure that benefits from explicit state and branching, (b) the deployment context wants tracing, observability, and human-in-the-loop tooling, and (c) the team is already invested in the LangChain ecosystem. For these cases, the framework adds 5-15 points compared to a plain ReAct loop with the same model and tools.

LangGraph adds less value, or even subtracts value, when (a) the agent workflow is a simple single-loop ReAct pattern (a lighter framework or hand-rolled loop is faster to develop and equally performant), (b) the goal is pure benchmark score where every point matters (purpose-built scaffolds outperform), or (c) the team is starting fresh with no LangChain investment and the rest of the stack is more naturally aligned with a different framework. We have seen production agents that switched from LangGraph to a simpler design after determining that the graph abstraction added complexity without proportional benefit.

A useful heuristic: if you can sketch your agent workflow as a graph on a whiteboard with at least three branches and a loop, LangGraph likely fits. If you sketch it as "think, act, observe, repeat until done", a lighter framework likely fits. The graph-vs-loop distinction is the right framing for the framework choice.

How to read a LangGraph benchmark number

Three checks before quoting a LangGraph benchmark score. First, what model is underneath? A LangGraph + Claude 4 score and a LangGraph + Llama-3-70B score are not directly comparable. The framework adds a small bonus over the underlying model; the bigger driver is the model itself. Second, what graph topology? Submissions vary widely in graph design, and a sophisticated multi-stage graph can outperform a simple linear graph by 10+ points on the same model. Third, is the score on the official leaderboard or a blog post? Blog-post numbers are often best-case under unusually generous configurations; leaderboard numbers are reproducible.

When citing LangGraph benchmark performance, the honest pattern is "LangGraph + [model] reaches X percent on [benchmark] with [graph topology], according to [source]". Vague claims like "LangGraph scores 50 percent on SWE-bench" are misleading because they hide the model and configuration. The framework is a meaningful contributor to the score but rarely the dominant one.

Editor's verdictLangGraph is the most popular open agent framework in 2026 and a credible choice for production multi-step agents. Benchmark scores trail the absolute frontier (proprietary scaffolds win raw points) but lead among open frameworks. Use it for stateful, branching agents; reach for a lighter framework for simple ReAct loops.

Reader Questions

Q.01What is LangGraph?+

LangGraph is the agent-orchestration framework released by LangChain in early 2024 as a successor to its earlier AgentExecutor abstractions. It models agent workflows as directed graphs of state transitions: nodes are functions or LLM calls, edges are conditional transitions, and the runtime tracks accumulated state through the graph. The graph model is more expressive than the earlier ReAct-style loop and has become the LangChain ecosystem's recommended pattern for building multi-step agents in 2026.

Q.02What benchmarks does LangGraph target?+

LangGraph itself does not publish a single official benchmark score because the framework's performance depends on (a) the underlying model, (b) the graph design, and (c) the tools made available. Community submissions and LangChain blog posts typically report on SWE-bench Verified for coding, GAIA for general assistant tasks, and WebArena or BrowserGym variants for browser work. The LangChain team publishes a benchmark registry (smith.langchain.com) where users can run reference graphs against datasets including HotPotQA, MultiHopRAG, and a number of internal evaluation sets.

Q.03What are typical LangGraph scores on SWE-bench Verified?+

Community LangGraph submissions on SWE-bench Verified in 2026 typically land in the 35-55 percent range, depending on the underlying model. The LangChain blog has documented LangGraph configurations that reach the high 50s with strong frontier models and well-designed graphs. These scores trail the absolute frontier (low-to-mid 70s) achieved by purpose-built agentic harnesses like Anthropic's and OpenAI's proprietary scaffolds, but are competitive with other open frameworks like SWE-agent and Aider for production-deployable configurations.

Q.04How does LangGraph compare to AutoGen and CrewAI?+

Three different design philosophies. LangGraph models agent workflows as directed graphs with explicit state. AutoGen models them as multi-agent conversations with role-based agents. CrewAI models them as crew-and-task assignments where each crew member has a specialised role. For SWE-bench-style coding work LangGraph and SWE-agent (the Princeton harness) outperform AutoGen and CrewAI in published comparisons. For general-assistant work GAIA results are closer; the framework choice matters less than the underlying model. We have not seen reproducible benchmarks that rank the three frameworks across every task class.

Q.05Are LangGraph benchmark numbers official?+

Most are not. The LangChain team publishes some configurations and scores on the LangChain blog, but these are framework-best-cases rather than official rankings. Community submissions on SWE-bench Verified or GAIA leaderboards typically include a 'agent framework' field where LangGraph-using submissions are visible. For headline 2026 frontier claims, the proprietary scaffolds from Anthropic, OpenAI, and well-funded research labs lead the leaderboards; LangGraph submissions sit in the strong-but-not-top tier of open frameworks.

Q.06When should I use LangGraph?+

Use LangGraph when you need explicit state management across a multi-step agent workflow, when the graph topology is meaningfully complex (parallel branches, conditional re-entry, long-lived loops), and when you want the LangChain ecosystem's tracing, deployment, and human-in-the-loop tooling. Skip LangGraph if your workflow is a simple ReAct loop (lighter frameworks suffice) or if you need pure performance on a saturated benchmark (purpose-built scaffolds outperform). Most production agents that benefit from structure benefit from LangGraph.

Agent Benchmarks Overview →AutoGen Benchmarks →CrewAI Benchmarks →OpenAI Agents SDK →DSPy Benchmarks →SWE-bench Verified →GAIA Benchmark →

Sources

[1] LangGraph documentation. langchain-ai.github.io/langgraph. Accessed May 2026.
[2] LangGraph repository. github.com/langchain-ai/langgraph.
[3] LangChain Blog: SWE-bench Verified configurations and reference graphs. blog.langchain.dev.
[4] SWE-bench Verified leaderboard with framework annotations. swebench.com.
[5] LangSmith reference benchmarks. smith.langchain.com.