Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatMulti-agent conversation framework; agents talk to each other through structured messages with role-based system prompts
WhoMicrosoft Research (original) + AG2 community fork, 2023 (arXiv:2308.08155)
Bench TierSWE-bench Verified 30-45% community; MATH +5-12 points multi-agent over single-agent.
Repositorygithub.com/ag2ai/ag2
Section II.viii · Agent Frameworks|Last verified April 2026

AutoGen and AG2: Multi-Agent Conversations as a First-Class Pattern

The framework that made multi-agent collaboration a first-class production pattern. Roles, system prompts, structured conversations between agents. Where the pattern adds 5-15 points to benchmark scores, where it adds latency without value, and where the AG2 community fork stands relative to the original Microsoft Research project.

01

What AutoGen is

AutoGen, introduced by Wu et al. at Microsoft Research in late 2023, is a multi-agent conversation framework. The defining abstraction is the conversational agent: each agent has a role (planner, coder, reviewer, user-proxy), a system prompt that defines its responsibilities, and a tool inventory it can call. Agents communicate through structured messages, and the framework orchestrates the conversation flow with optional human-in-the-loop steps.

The framework's design philosophy is that many real-world workflows are easier to model as conversations between specialised roles than as a single monolithic agent. A code-generation task can be expressed as a coder agent and a reviewer agent exchanging proposals. A research task can be expressed as a planner agent decomposing the question and an executor agent answering each sub-query. A complex enterprise workflow can be expressed as a crew of specialists each handling their domain.

AutoGen has gone through a notable governance shift. The original Microsoft Research project transitioned much of its open-source community work to AG2 (formerly AutoGen 2.0), a community-stewarded fork at ag2.ai. Microsoft continues to maintain its own distribution under the original AutoGen name. The two distributions are largely API-compatible but have diverged in some advanced features (typed message contracts, observability hooks, deployment patterns). Most published benchmarks since mid-2024 use AG2; for new projects starting in 2026, AG2 is the more actively maintained option.

02

Multi-agent patterns

AutoGen and AG2 support several multi-agent patterns. The patterns differ in how agents are organised, how they take turns, and how the framework decides when the conversation is complete. Choosing the right pattern is the single biggest determinant of whether multi-agent adds value or only adds cost.

Pattern
When it helps
Coder + Reviewer
Two agents: one writes code, the other reviews and requests changes. The most-used pattern for SWE-bench-style work; adds 5-10 points over single-agent.
Planner + Executor
One agent decomposes the task into sub-steps; another executes each sub-step. Useful for complex multi-step workflows like GAIA Level 3.
Multi-agent debate
Several agents argue for different positions; a judge selects the winner. The original AutoGen paper used this for math; useful for ambiguous problems with multiple plausible approaches.
Specialised crew
Each agent has a specific domain expertise (database expert, API expert, security expert). Useful for cross-functional tasks that benefit from explicit specialisation.
User-proxy + assistant
The minimum viable multi-agent: a user-proxy agent drives interaction with an assistant agent. The default AutoGen pattern; useful when the workflow needs a clear separation between user-input and assistant-action.

The five patterns above account for the majority of production AutoGen deployments. The right pattern depends on the work: code generation suits coder-and-reviewer; research suits planner-and-executor; ambiguous problems suit multi-agent debate; cross-functional tasks suit specialised crew; the rest typically default to user-proxy-and-assistant. Choosing badly can add latency and cost without lifting the result; choosing well can add 5-15 points on benchmarks that genuinely benefit from the additional structure.

03

Benchmark coverage

AutoGen does not have a single official benchmark headline; the original paper reported per-task improvements on math, coding, and reasoning, and community submissions cover SWE-bench Verified, GAIA, RAG benchmarks, and various other agent benchmarks. The picture below is assembled from the original paper and community submissions, not from a single official source.

Benchmark
Reported tier
Note
SWE-bench Verified
30-45% (community submissions)
Multi-agent coder + reviewer configurations reach high 40s. Single-agent configurations trail by 5-10 points.
MATH (mathematics)
Multi-agent adds 5-12 points
Original AutoGen paper baseline; multi-agent debate improves over single-agent on hard math problems.
HumanEval
Saturated; +1-3 points from multi-agent
Easy single-function code; multi-agent overhead is not worth it on simple problems.
GAIA
35-50% with multi-agent
Planner-and-researcher decomposition helps on Level 2 and Level 3 tasks.
RAG benchmarks
Variable; framework neutral
Multi-agent helps on multi-hop retrieval; the underlying retrieval quality dominates.
04

Multi-agent on SWE-bench Verified

SWE-bench Verified is the most-watched coding-agent benchmark, and AutoGen submissions cluster in the 30-45 percent range depending on configuration. The strongest published AutoGen SWE-bench configurations use a coder-plus-reviewer pattern with frontier models: the coder proposes a patch, the reviewer (often using the same underlying model with a different system prompt) evaluates the patch and either approves or requests specific changes. This pattern adds roughly 5-10 points over a single-agent baseline on the same model.

The reviewer-pattern improvement reflects something genuine about how SWE-bench tasks fail. The single most common failure mode in single-agent submissions is patch authoring that introduces test regressions: the patch fixes the failing test but breaks other tests in the suite. A reviewer agent with explicit instructions to check for regressions catches a meaningful fraction of these before submission. The improvement is largest on tasks with tight test suites; on tasks with loose tests, the reviewer pattern adds latency without much benefit.

AutoGen SWE-bench scores trail the leaderboard frontier (low-to-mid 70s) by 20-30 points. The gap reflects the same pattern seen across open frameworks: proprietary scaffolds from Anthropic and OpenAI win on absolute score; open frameworks win on production-deployable structure. AutoGen with strong multi-agent patterns is competitive with LangGraph and CrewAI in this tier.

05

Multi-agent on MATH

The original AutoGen paper documented its largest single-task improvement on MATH (the math-competition dataset). Multi-agent debate configurations (where two or more agents argue for different solutions and a judge picks the winner) improved over single-agent baselines by 5-12 points on hard math problems. The improvement was largest on problems where the model's first-attempt reasoning was wrong but at least one of several alternative reasoning paths was correct; the debate format allowed the framework to surface the correct path.

The MATH improvement was one of the original motivations for multi-agent frameworks generally; subsequent papers have reproduced similar improvements on related math and reasoning tasks. The improvement is smaller on saturated benchmarks where the underlying model is already near-perfect (HumanEval): the model gets it right first time, and the additional rounds add cost without value. The pattern generalises: multi-agent helps where the underlying model is competent but error-prone, and helps less where the model is either uniformly correct or uniformly wrong.

06

When to use AutoGen and when not

Use AutoGen / AG2 when your workflow naturally decomposes into roles that can communicate. The clearest fit is multi-agent debate, planner-and-executor, and coder-and-reviewer patterns. The framework adds the most value when the underlying model is competent but error-prone in ways that a different perspective can catch (verification, decomposition, specialised expertise).

Avoid AutoGen when (a) the workflow is a simple single-loop pattern, (b) latency is critical and multi-agent adds round-trips, (c) the team is starting from scratch and wants the smallest viable framework, or (d) the underlying model is uniformly strong on the task and the multi-agent overhead does not improve the result. We have seen production teams switch from AutoGen to a single-agent framework after determining that the multi-agent overhead was not paying for itself.

The framework competes with LangGraph and CrewAI in roughly the same problem space; choose based on how you naturally model the workflow (conversation, graph, or crew). The OpenAI Agents SDK covers similar ground for OpenAI-first stacks.

07

How to read an AutoGen benchmark number

Three checks before quoting an AutoGen benchmark score. First, is it the AG2 fork or the original Microsoft AutoGen? Most current submissions are AG2; the two are mostly compatible but have diverged in some advanced features. Second, what multi-agent pattern is configured? A single-agent AutoGen score is essentially a baseline-model score; a multi-agent score shows the framework's genuine contribution. Third, what underlying model? AutoGen + Claude 4 and AutoGen + Llama-3-70B are different starting points.

The honest pattern when citing AutoGen performance is: "AutoGen [or AG2] in [pattern] configuration with [model] reaches X percent on [benchmark], according to [source]". Vague claims hide the configuration that does most of the work. The framework is one ingredient in the score, not the whole dish.

Editor's verdictAutoGen and AG2 are the canonical multi-agent frameworks. Use the multi-agent patterns when the work decomposes naturally into roles; skip them when single-agent suffices. Benchmarks confirm that multi-agent adds 5-15 points on tasks that benefit from verification, decomposition, or specialisation; less elsewhere.
Reader Questions
Q.01What is AutoGen?+
AutoGen is the multi-agent conversation framework released by Microsoft Research in late 2023. The defining abstraction is conversational agents: each agent has a role (planner, coder, reviewer, user-proxy) and a system prompt that defines its responsibilities. Agents talk to each other through structured messages, and the framework orchestrates the conversation flow with optional human-in-the-loop steps. AutoGen's design emphasises multi-agent collaboration as a first-class pattern; many production agents that use the framework run two-to-five-agent teams rather than single agents.
Q.02What is AG2 and how does it relate to AutoGen?+
AG2 (formerly AutoGen 2.0) is the community-driven fork of the original Microsoft AutoGen project, hosted at ag2.ai. The original AutoGen team transitioned the project's open-source steward role to a community foundation in early 2024; AG2 is the actively-developed fork that most contributors and users now follow. Microsoft continues to maintain its own AutoGen distribution under the original name. The two distributions are largely compatible at the API level but have diverged in some advanced features. Most published benchmarks since mid-2024 use AG2.
Q.03What benchmarks does AutoGen target?+
AutoGen does not have a single official benchmark target. The original Microsoft Research paper at arXiv:2308.08155 reported multi-agent improvements on math (MATH dataset), coding (HumanEval), and reasoning (Q&A datasets), with multi-agent configurations adding 5-15 points over single-agent baselines on the same models. Community benchmarks since then have focused on SWE-bench Verified, GAIA, and various RAG datasets. AG2 publishes reference benchmarks on its documentation site for new releases.
Q.04Where does AutoGen sit on SWE-bench Verified?+
AutoGen submissions on SWE-bench Verified in 2026 typically score 30-45 percent depending on the underlying model and the multi-agent configuration. Strong configurations (multi-agent with explicit coder-and-reviewer roles, frontier models, structured tool inventory) reach the high 40s. AutoGen is in the same broad tier as LangGraph and CrewAI for SWE-bench coding work; the framework choice matters less than the underlying model and the agent topology. The proprietary single-agent scaffolds from Anthropic and OpenAI lead the leaderboard with low-to-mid 70s.
Q.05Do multi-agent patterns help on benchmarks?+
Sometimes. The original AutoGen paper showed that multi-agent configurations (e.g. coder + reviewer + tester) improve scores by 5-15 points over single-agent baselines on math and coding tasks. The improvement is real but not universal. On simple tasks (where the bottleneck is one model's reasoning), multi-agent adds latency and cost without lifting the score. On complex tasks (where the bottleneck is verification, decomposition, or specialised expertise), multi-agent adds meaningful capability. The pattern is most useful when the work naturally decomposes into clear sub-roles.
Q.06When should I use AutoGen vs LangGraph or CrewAI?+
Use AutoGen / AG2 when your workflow is naturally conversational and benefits from multi-agent patterns: agents arguing for and against a position, a planner-and-executor pair, a code-generator-plus-code-reviewer team. Use LangGraph when your workflow is naturally graph-shaped with explicit state and branching. Use CrewAI when your workflow is naturally crew-shaped with specialised roles working through a sequence of tasks. The three frameworks cover overlapping but distinct design spaces; the right choice depends on how you naturally model the work.
Agent Benchmarks OverviewLangGraph BenchmarksCrewAI BenchmarksOpenAI Agents SDKDSPy BenchmarksSWE-bench VerifiedGAIA Benchmark

Sources

  1. [1] Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
  2. [2] AG2 (community fork) project site. ag2.ai. Accessed May 2026.
  3. [3] AG2 repository. github.com/ag2ai/ag2.
  4. [4] Original Microsoft AutoGen repository. github.com/microsoft/autogen.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.