AutoGen and AG2: Multi-Agent Conversations as a First-Class Pattern
The framework that made multi-agent collaboration a first-class production pattern. Roles, system prompts, structured conversations between agents. Where the pattern adds 5-15 points to benchmark scores, where it adds latency without value, and where the AG2 community fork stands relative to the original Microsoft Research project.
What AutoGen is
AutoGen, introduced by Wu et al. at Microsoft Research in late 2023, is a multi-agent conversation framework. The defining abstraction is the conversational agent: each agent has a role (planner, coder, reviewer, user-proxy), a system prompt that defines its responsibilities, and a tool inventory it can call. Agents communicate through structured messages, and the framework orchestrates the conversation flow with optional human-in-the-loop steps.
The framework's design philosophy is that many real-world workflows are easier to model as conversations between specialised roles than as a single monolithic agent. A code-generation task can be expressed as a coder agent and a reviewer agent exchanging proposals. A research task can be expressed as a planner agent decomposing the question and an executor agent answering each sub-query. A complex enterprise workflow can be expressed as a crew of specialists each handling their domain.
AutoGen has gone through a notable governance shift. The original Microsoft Research project transitioned much of its open-source community work to AG2 (formerly AutoGen 2.0), a community-stewarded fork at ag2.ai. Microsoft continues to maintain its own distribution under the original AutoGen name. The two distributions are largely API-compatible but have diverged in some advanced features (typed message contracts, observability hooks, deployment patterns). Most published benchmarks since mid-2024 use AG2; for new projects starting in 2026, AG2 is the more actively maintained option.
Multi-agent patterns
AutoGen and AG2 support several multi-agent patterns. The patterns differ in how agents are organised, how they take turns, and how the framework decides when the conversation is complete. Choosing the right pattern is the single biggest determinant of whether multi-agent adds value or only adds cost.
The five patterns above account for the majority of production AutoGen deployments. The right pattern depends on the work: code generation suits coder-and-reviewer; research suits planner-and-executor; ambiguous problems suit multi-agent debate; cross-functional tasks suit specialised crew; the rest typically default to user-proxy-and-assistant. Choosing badly can add latency and cost without lifting the result; choosing well can add 5-15 points on benchmarks that genuinely benefit from the additional structure.
Benchmark coverage
AutoGen does not have a single official benchmark headline; the original paper reported per-task improvements on math, coding, and reasoning, and community submissions cover SWE-bench Verified, GAIA, RAG benchmarks, and various other agent benchmarks. The picture below is assembled from the original paper and community submissions, not from a single official source.
Multi-agent on SWE-bench Verified
SWE-bench Verified is the most-watched coding-agent benchmark, and AutoGen submissions cluster in the 30-45 percent range depending on configuration. The strongest published AutoGen SWE-bench configurations use a coder-plus-reviewer pattern with frontier models: the coder proposes a patch, the reviewer (often using the same underlying model with a different system prompt) evaluates the patch and either approves or requests specific changes. This pattern adds roughly 5-10 points over a single-agent baseline on the same model.
The reviewer-pattern improvement reflects something genuine about how SWE-bench tasks fail. The single most common failure mode in single-agent submissions is patch authoring that introduces test regressions: the patch fixes the failing test but breaks other tests in the suite. A reviewer agent with explicit instructions to check for regressions catches a meaningful fraction of these before submission. The improvement is largest on tasks with tight test suites; on tasks with loose tests, the reviewer pattern adds latency without much benefit.
AutoGen SWE-bench scores trail the leaderboard frontier (low-to-mid 70s) by 20-30 points. The gap reflects the same pattern seen across open frameworks: proprietary scaffolds from Anthropic and OpenAI win on absolute score; open frameworks win on production-deployable structure. AutoGen with strong multi-agent patterns is competitive with LangGraph and CrewAI in this tier.
Multi-agent on MATH
The original AutoGen paper documented its largest single-task improvement on MATH (the math-competition dataset). Multi-agent debate configurations (where two or more agents argue for different solutions and a judge picks the winner) improved over single-agent baselines by 5-12 points on hard math problems. The improvement was largest on problems where the model's first-attempt reasoning was wrong but at least one of several alternative reasoning paths was correct; the debate format allowed the framework to surface the correct path.
The MATH improvement was one of the original motivations for multi-agent frameworks generally; subsequent papers have reproduced similar improvements on related math and reasoning tasks. The improvement is smaller on saturated benchmarks where the underlying model is already near-perfect (HumanEval): the model gets it right first time, and the additional rounds add cost without value. The pattern generalises: multi-agent helps where the underlying model is competent but error-prone, and helps less where the model is either uniformly correct or uniformly wrong.
When to use AutoGen and when not
Use AutoGen / AG2 when your workflow naturally decomposes into roles that can communicate. The clearest fit is multi-agent debate, planner-and-executor, and coder-and-reviewer patterns. The framework adds the most value when the underlying model is competent but error-prone in ways that a different perspective can catch (verification, decomposition, specialised expertise).
Avoid AutoGen when (a) the workflow is a simple single-loop pattern, (b) latency is critical and multi-agent adds round-trips, (c) the team is starting from scratch and wants the smallest viable framework, or (d) the underlying model is uniformly strong on the task and the multi-agent overhead does not improve the result. We have seen production teams switch from AutoGen to a single-agent framework after determining that the multi-agent overhead was not paying for itself.
The framework competes with LangGraph and CrewAI in roughly the same problem space; choose based on how you naturally model the workflow (conversation, graph, or crew). The OpenAI Agents SDK covers similar ground for OpenAI-first stacks.
How to read an AutoGen benchmark number
Three checks before quoting an AutoGen benchmark score. First, is it the AG2 fork or the original Microsoft AutoGen? Most current submissions are AG2; the two are mostly compatible but have diverged in some advanced features. Second, what multi-agent pattern is configured? A single-agent AutoGen score is essentially a baseline-model score; a multi-agent score shows the framework's genuine contribution. Third, what underlying model? AutoGen + Claude 4 and AutoGen + Llama-3-70B are different starting points.
The honest pattern when citing AutoGen performance is: "AutoGen [or AG2] in [pattern] configuration with [model] reaches X percent on [benchmark], according to [source]". Vague claims hide the configuration that does most of the work. The framework is one ingredient in the score, not the whole dish.
Q.01What is AutoGen?+
Q.02What is AG2 and how does it relate to AutoGen?+
Q.03What benchmarks does AutoGen target?+
Q.04Where does AutoGen sit on SWE-bench Verified?+
Q.05Do multi-agent patterns help on benchmarks?+
Q.06When should I use AutoGen vs LangGraph or CrewAI?+
Sources
- [1] Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
- [2] AG2 (community fork) project site. ag2.ai. Accessed May 2026.
- [3] AG2 repository. github.com/ag2ai/ag2.
- [4] Original Microsoft AutoGen repository. github.com/microsoft/autogen.