Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatMulti-agent conversation framework; agents talk to each other through structured messages with role-based system prompts
WhoMicrosoft Research (original) + AG2 community fork, 2023 (arXiv:2308.08155)
Bench TierSame broad tier as other open frameworks; scores are configuration-specific. Original paper showed multi-agent lifts over single-agent on math and reasoning.
Repositorygithub.com/ag2ai/ag2
Section II.viii · Agent Frameworks|Reviewed 2026

AutoGen and AG2: Multi-Agent Conversations as a First-Class Pattern

The framework that made multi-agent collaboration a first-class production pattern. Roles, system prompts, structured conversations between agents. Where the pattern adds 5-15 points to benchmark scores, where it adds latency without value, and where the AG2 community fork stands relative to the original Microsoft Research project.

01

What AutoGen is

AutoGen, introduced by Wu et al. at Microsoft Research in late 2023, is a multi-agent conversation framework. The defining abstraction is the conversational agent: each agent has a role (planner, coder, reviewer, user-proxy), a system prompt that defines its responsibilities, and a tool inventory it can call. Agents communicate through structured messages, and the framework orchestrates the conversation flow with optional human-in-the-loop steps.

The framework's design philosophy is that many real-world workflows are easier to model as conversations between specialised roles than as a single monolithic agent. A code-generation task can be expressed as a coder agent and a reviewer agent exchanging proposals. A research task can be expressed as a planner agent decomposing the question and an executor agent answering each sub-query. A complex enterprise workflow can be expressed as a crew of specialists each handling their domain.

AutoGen has gone through a notable governance shift. The original Microsoft Research project transitioned much of its open-source community work to AG2 (formerly AutoGen 2.0), a community-stewarded fork at ag2.ai. Microsoft continues to maintain its own distribution under the original AutoGen name. The two distributions are largely API-compatible but have diverged in some advanced features (typed message contracts, observability hooks, deployment patterns). Most published benchmarks since mid-2024 use AG2; for new projects starting in 2026, AG2 is the more actively maintained option.

02

Multi-agent patterns

AutoGen and AG2 support several multi-agent patterns. The patterns differ in how agents are organised, how they take turns, and how the framework decides when the conversation is complete. Choosing the right pattern is the single biggest determinant of whether multi-agent adds value or only adds cost.

Pattern
When it helps
Coder + Reviewer
Two agents: one writes code, the other reviews and requests changes. The most-used pattern for SWE-bench-style work; a reviewer that checks for test regressions catches a meaningful fraction of single-agent failures.
Planner + Executor
One agent decomposes the task into sub-steps; another executes each sub-step. Useful for complex multi-step workflows like GAIA Level 3.
Multi-agent debate
Several agents argue for different positions; a judge selects the winner. The original AutoGen paper used this for math; useful for ambiguous problems with multiple plausible approaches.
Specialised crew
Each agent has a specific domain expertise (database expert, API expert, security expert). Useful for cross-functional tasks that benefit from explicit specialisation.
User-proxy + assistant
The minimum viable multi-agent: a user-proxy agent drives interaction with an assistant agent. The default AutoGen pattern; useful when the workflow needs a clear separation between user-input and assistant-action.

The five patterns above account for the majority of production AutoGen deployments. The right pattern depends on the work: code generation suits coder-and-reviewer; research suits planner-and-executor; ambiguous problems suit multi-agent debate; cross-functional tasks suit specialised crew; the rest typically default to user-proxy-and-assistant. Choosing badly can add latency and cost without lifting the result; choosing well genuinely helps on tasks that benefit from the additional structure.

03

Benchmark coverage

AutoGen does not have a single official benchmark headline; the original paper reported per-task improvements on math, coding, and reasoning, and community submissions cover SWE-bench Verified, GAIA, RAG benchmarks, and various other agent benchmarks. We do not reprint a per-benchmark score table here, because any AutoGen score is a property of a specific configuration (underlying model, multi-agent pattern, tool inventory) rather than of the framework, and a bare grid of percentages strips that context away and goes stale as the frontier moves. For the original per-task figures, read the AutoGen paper; for current leaderboard standings, the framework-annotated SWE-bench Verified leaderboard shows the model and configuration behind each AutoGen-using submission. To pick a benchmark to start from, use the task picker on the homepage.

04

Multi-agent on SWE-bench Verified

SWE-bench Verified is the most-watched coding-agent benchmark. The strongest published AutoGen SWE-bench configurations use a coder-plus-reviewer pattern with frontier models: the coder proposes a patch, the reviewer (often using the same underlying model with a different system prompt) evaluates the patch and either approves or requests specific changes. This pattern measurably outperforms a single-agent baseline on the same model, though the exact margin depends on the model and the task set.

The reviewer-pattern improvement reflects something genuine about how SWE-bench tasks fail. The single most common failure mode in single-agent submissions is patch authoring that introduces test regressions: the patch fixes the failing test but breaks other tests in the suite. A reviewer agent with explicit instructions to check for regressions catches a meaningful fraction of these before submission. The improvement is largest on tasks with tight test suites; on tasks with loose tests, the reviewer pattern adds latency without much benefit.

AutoGen SWE-bench scores trail the leaderboard frontier set by the purpose-built proprietary scaffolds. The gap reflects the same pattern seen across open frameworks: the proprietary scaffolds win on absolute score; open frameworks win on production-deployable structure. AutoGen with strong multi-agent patterns is competitive with LangGraph and CrewAI in this tier.

05

Multi-agent on MATH

The original AutoGen paper documented its largest single-task improvement on MATH (the math-competition dataset). Multi-agent debate configurations (where two or more agents argue for different solutions and a judge picks the winner) improved over single-agent baselines by 5-12 points on hard math problems. The improvement was largest on problems where the model's first-attempt reasoning was wrong but at least one of several alternative reasoning paths was correct; the debate format allowed the framework to surface the correct path.

The MATH improvement was one of the original motivations for multi-agent frameworks generally; subsequent papers have reproduced similar improvements on related math and reasoning tasks. The improvement is smaller on saturated benchmarks where the underlying model is already near-perfect (HumanEval): the model gets it right first time, and the additional rounds add cost without value. The pattern generalises: multi-agent helps where the underlying model is competent but error-prone, and helps less where the model is either uniformly correct or uniformly wrong.

06

When to use AutoGen and when not

Use AutoGen / AG2 when your workflow naturally decomposes into roles that can communicate. The clearest fit is multi-agent debate, planner-and-executor, and coder-and-reviewer patterns. The framework adds the most value when the underlying model is competent but error-prone in ways that a different perspective can catch (verification, decomposition, specialised expertise).

Avoid AutoGen when (a) the workflow is a simple single-loop pattern, (b) latency is critical and multi-agent adds round-trips, (c) the team is starting from scratch and wants the smallest viable framework, or (d) the underlying model is uniformly strong on the task and the multi-agent overhead does not improve the result. We have seen production teams switch from AutoGen to a single-agent framework after determining that the multi-agent overhead was not paying for itself.

The framework competes with LangGraph and CrewAI in roughly the same problem space; choose based on how you naturally model the workflow (conversation, graph, or crew). The OpenAI Agents SDK covers similar ground for OpenAI-first stacks.

07

How to read an AutoGen benchmark number

Three checks before quoting an AutoGen benchmark score. First, is it the AG2 fork or the original Microsoft AutoGen? Most current submissions are AG2; the two are mostly compatible but have diverged in some advanced features. Second, what multi-agent pattern is configured? A single-agent AutoGen score is essentially a baseline-model score; a multi-agent score shows the framework's genuine contribution. Third, what underlying model? Two AutoGen scores on different underlying models are different starting points.

The honest pattern when citing AutoGen performance is: "AutoGen [or AG2] in [pattern] configuration with [model] reaches X percent on [benchmark], according to [source]". Vague claims hide the configuration that does most of the work. The framework is one ingredient in the score, not the whole dish.

Editor's verdictAutoGen and AG2 are the canonical multi-agent frameworks. Use the multi-agent patterns when the work decomposes naturally into roles; skip them when single-agent suffices. The original paper and subsequent work show multi-agent helps on tasks that benefit from verification, decomposition, or specialisation, and adds only cost elsewhere; measure the lift on your own task rather than trusting a headline number.
Reader Questions
Q.01What is AutoGen?+
AutoGen is the multi-agent conversation framework released by Microsoft Research in late 2023. The defining abstraction is conversational agents: each agent has a role (planner, coder, reviewer, user-proxy) and a system prompt that defines its responsibilities. Agents talk to each other through structured messages, and the framework orchestrates the conversation flow with optional human-in-the-loop steps. AutoGen's design emphasises multi-agent collaboration as a first-class pattern; many production agents that use the framework run two-to-five-agent teams rather than single agents.
Q.02What is AG2 and how does it relate to AutoGen?+
AG2 (formerly AutoGen 2.0) is the community-driven fork of the original Microsoft AutoGen project, hosted at ag2.ai. The original AutoGen team transitioned the project's open-source steward role to a community foundation in early 2024; AG2 is the actively-developed fork that most contributors and users now follow. Microsoft continues to maintain its own AutoGen distribution under the original name. The two distributions are largely compatible at the API level but have diverged in some advanced features. Most published benchmarks since mid-2024 use AG2.
Q.03What benchmarks does AutoGen target?+
AutoGen does not have a single official benchmark target. The original Microsoft Research paper at arXiv:2308.08155 reported multi-agent improvements on math (MATH dataset), coding (HumanEval), and reasoning (Q&A datasets), with multi-agent configurations adding 5-15 points over single-agent baselines on the same models. Community benchmarks since then have focused on SWE-bench Verified, GAIA, and various RAG datasets. AG2 publishes reference benchmarks on its documentation site for new releases.
Q.04Where does AutoGen sit on SWE-bench Verified?+
We do not quote a headline range, because an AutoGen SWE-bench score is a property of the configuration (underlying model, multi-agent pattern, tool inventory) rather than of the framework. Qualitatively, AutoGen is in the same broad tier as LangGraph and CrewAI for SWE-bench coding work; the framework choice matters less than the underlying model and the agent topology, and the purpose-built proprietary scaffolds lead the leaderboard. For current figures, the framework-annotated SWE-bench leaderboard at swebench.com shows the model and configuration behind each AutoGen-using submission.
Q.05Do multi-agent patterns help on benchmarks?+
Sometimes. The original AutoGen paper showed that multi-agent configurations (e.g. coder + reviewer + tester) improve scores by 5-15 points over single-agent baselines on math and coding tasks. The improvement is real but not universal. On simple tasks (where the bottleneck is one model's reasoning), multi-agent adds latency and cost without lifting the score. On complex tasks (where the bottleneck is verification, decomposition, or specialised expertise), multi-agent adds meaningful capability. The pattern is most useful when the work naturally decomposes into clear sub-roles.
Q.06When should I use AutoGen vs LangGraph or CrewAI?+
Use AutoGen / AG2 when your workflow is naturally conversational and benefits from multi-agent patterns: agents arguing for and against a position, a planner-and-executor pair, a code-generator-plus-code-reviewer team. Use LangGraph when your workflow is naturally graph-shaped with explicit state and branching. Use CrewAI when your workflow is naturally crew-shaped with specialised roles working through a sequence of tasks. The three frameworks cover overlapping but distinct design spaces; the right choice depends on how you naturally model the work.
Agent Benchmarks OverviewLangGraph BenchmarksCrewAI BenchmarksOpenAI Agents SDKDSPy BenchmarksSWE-bench VerifiedGAIA Benchmark

Sources

  1. [1] Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
  2. [2] AG2 (community fork) project site. ag2.ai.
  3. [3] AG2 repository. github.com/ag2ai/ag2.
  4. [4] Original Microsoft AutoGen repository. github.com/microsoft/autogen.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.