Independent reference. No vendor affiliation. Scores cited with source and capture date. Affiliate disclosure.
Abstract
WhatProgrammatic prompt-optimisation framework: write pipelines as typed programs, compile to optimised prompts
WhoKhattab et al., Stanford NLP, 2023 (arXiv:2310.03714)
Bench TierReports improvement over hand-prompted baselines; lift is task- and configuration-specific, not a fixed headline number.
Repositorygithub.com/stanfordnlp/dspy
Section II.xi · Agent Frameworks|Reviewed 2026

DSPy: Programmatic Prompt Optimisation, Not Just Orchestration

The framework that treats prompt engineering as a compilation step rather than a hand-tuning step. Stanford-led, optimisation-first design, real lifts on multi-hop QA and structured reasoning. A different category from LangGraph and AutoGen: not orchestration, but programmatic prompt and pipeline optimisation that you use alongside an orchestration framework or on its own.

01

What DSPy is

DSPy, introduced by Khattab et al. at Stanford NLP in late 2023, is a framework for programming language model pipelines as typed Python programs. The defining abstraction is the signature: a declaration of input fields, output fields, and an optional task description. A DSPy program composes signatures into a pipeline; the framework then compiles the pipeline by generating and optimising prompts using one of several optimiser algorithms.

The key conceptual shift is treating prompt engineering as compilation. Other frameworks ask you to write the prompt by hand, sometimes with a few-shot template, and then call the language model. DSPy asks you to write the program structure, then runs an optimisation step that produces the prompts automatically using your training data and chosen optimiser. The result is prompts you would not have written by hand, often substantially better than hand-tuned baselines.

DSPy is a different category from agent-orchestration frameworks. LangGraph, AutoGen, CrewAI, and the OpenAI Agents SDK are about defining agents, tools, and execution loops; the prompts inside each step are written by you. DSPy is about generating those prompts programmatically. The two frameworks compose: production teams increasingly use LangGraph or AutoGen for orchestration and DSPy for prompt optimisation inside each reasoning step.

02

The optimisation surface

DSPy ships with several optimisers, each suited to different program shapes and data budgets. Choosing the right optimiser is the most important decision when starting with the framework; the wrong choice can compile slowly without much improvement, while the right choice can lift scores significantly.

Optimiser
When it fits
BootstrapFewShot
Generates few-shot examples by running the program on training data and bootstrapping correct trajectories. The simplest and most-used DSPy optimiser; good baseline for any new program.
BootstrapFewShotWithRandomSearch
Adds random search over few-shot example selection. Modest improvement over plain BootstrapFewShot at moderate compilation cost.
MIPROv2
Multi-step prompt optimisation using LLM-generated prompt candidates and Bayesian optimisation. The strongest general-purpose DSPy optimiser in 2026; recommended for serious applications.
COPRO
Coordinated prompt optimisation that iteratively refines instructions. Useful when the task benefits from instruction tuning rather than few-shot example tuning.
BootstrapFinetune
Bootstraps trajectories then fine-tunes the model on them (rather than only tuning prompts). Most powerful but requires a tunable model and meaningful compute budget.

MIPROv2 is the recommended default for serious applications in 2026. BootstrapFewShot is the right starting point when you are evaluating the framework or want a quick baseline. BootstrapFinetune is the most powerful but requires a fine-tunable model and substantial compute, which most teams will not invest in. The choice depends on your data budget and your tolerance for compilation latency.

03

Benchmark coverage

DSPy results are reported as improvement over a hand-prompted baseline rather than as absolute state-of-the-art, because the framework's contribution is the optimisation step on top of whatever prompt you would otherwise write. We deliberately do not reprint a per-benchmark lift table here. The honest reason is that the lift on any given benchmark is a property of a specific configuration (which optimiser, which underlying model, how many training examples, what baseline you are measuring against), not of the framework in the abstract, and a bare grid of percentages strips that context away and goes stale as soon as the model changes. For the original figures, read the DSPy paper and the MIPROv2 paper directly, and measure the lift on your own task before relying on it. To pick a benchmark to start from, use the homepage task picker at the homepage.

04

Where DSPy wins

DSPy adds the most value on three task types. First, multi-hop QA and RAG: HotPotQA and MultiHopRAG have prompt shapes that are non-obvious to hand-tune (how to phrase the sub-query, how to format the retrieved context, how to chain the inference), and DSPy's optimisation reliably finds prompts better than hand-tuned baselines.

Second, structured reasoning with verifiable outputs. Math benchmarks like GSM8K, code benchmarks where output correctness can be checked programmatically, and any task where the optimiser has an automated success signal to compile against. The verifiability matters because DSPy's optimisers need a reliable evaluation function; tasks where success is judged by a noisy LLM or by humans are harder to compile against.

Third, pipelines with multiple LLM calls. A single-call task often benefits little from DSPy because there is one prompt to optimise and the gain over careful hand-tuning is marginal. A pipeline with three or four LLM calls (decompose, retrieve, reason, synthesise) benefits substantially because the joint optimisation across calls captures interactions that would be impossible to hand-tune.

05

Where DSPy adds less

DSPy adds less value on three task types. First, single-call tasks where the prompt is well-understood. A simple summarisation or classification task with a hand-tuned prompt is often near-saturation; the framework's lift is small and the operational cost (training data, compilation latency, deployment complexity) is not justified.

Second, tasks without good evaluation functions. Open-ended creative writing, free-form Q&A judged by humans, or anything graded by an unreliable LLM-as-judge: DSPy's optimisation depends on a stable evaluation signal; without one, the framework can compile to prompts that maximise the noisy signal without genuinely improving the task.

Third, agent workflows where the bottleneck is orchestration rather than prompt design. If the work decomposes into clear roles (use CrewAI), needs explicit state and branching (use LangGraph), or benefits from multi-agent debate (use AutoGen), the orchestration framework solves the bigger problem and DSPy adds marginal benefit on top.

06

Production patterns and operational considerations

Three patterns recur in production DSPy deployments. First, compile-once-deploy-many: run the optimisation step in CI when the model or training data changes, then deploy the compiled prompts as static templates. This avoids paying the optimisation cost in the request path. Second, compile-on-update: when the underlying model version changes, re-compile the program against the new model because optimised prompts are model-specific. Third, hybrid hand-and-compile: hand-tune the high-level program structure and let DSPy optimise the prompts inside each step.

Operational considerations: the framework requires labelled training data (typically a few hundred to a few thousand examples); compilation can take minutes to hours depending on the optimiser and program size; compiled prompts can be brittle when the model is updated, requiring re-compilation; the framework's mental model takes time to learn (the typed-signature abstraction is unusual). For teams that have the data, time, and patience to learn the abstractions, DSPy's lift is genuine and worth the investment.

07

When to use DSPy in 2026

Use DSPy when (a) the task has structured outputs amenable to programmatic verification, (b) you have a few hundred or more labelled training examples, (c) the prompt design is non-obvious enough that hand-tuning would be a substantial effort, and (d) you are comfortable with the framework's mental model. The clearest wins are on multi-hop QA, RAG pipelines, and structured-reasoning tasks where the prompt shape is genuinely hard to design.

Skip DSPy when the task is simple enough to hand-tune to near-saturation, when you lack labelled training data, when the evaluation signal is too noisy to optimise against, or when the bottleneck is orchestration rather than prompt design. The framework is the right tool for a specific class of problem; it is not a general agent-orchestration framework and should not be evaluated against LangGraph or AutoGen on the same axis.

Editor's verdictDSPy is a different category from agent-orchestration frameworks: it compiles prompts programmatically rather than asking you to write them by hand. The lift is genuine on multi-hop QA, structured reasoning, and complex RAG, but it is configuration-specific rather than a fixed headline number, so measure it on your own task. Use alongside LangGraph or AutoGen, not as a replacement.
Reader Questions
Q.01What is DSPy?+
DSPy is the Stanford-led framework for programming language model pipelines, introduced by Khattab et al. in 2023 (arXiv:2310.03714) and steadily expanded since. The defining idea is to write LM pipelines as Python programs of typed signatures (input fields, output fields) rather than as prompts, then compile the program into optimised prompts using training data and explicit optimisation algorithms. The framework treats prompt engineering as a compilation step, not as a hand-tuning step.
Q.02How is DSPy different from other agent frameworks?+
DSPy is fundamentally a different category. LangGraph, AutoGen, CrewAI, and the OpenAI SDK are agent-orchestration frameworks: they provide primitives for defining agents, tools, and execution loops, but the prompts inside each agent step are written by hand. DSPy is a programmatic-prompt-optimisation framework: it generates and optimises the prompts programmatically using algorithms like BootstrapFewShot, MIPROv2, or COPRO. You can use DSPy alongside an orchestration framework, but DSPy's distinct value is the optimisation, not the orchestration.
Q.03What benchmarks does DSPy publish?+
The original DSPy paper (Khattab et al., 2023) reported on GSM8K (mathematics), HotPotQA (multi-hop QA), and a few smaller datasets, framing results as improvement over a hand-prompted baseline rather than absolute state-of-the-art, because the framework's contribution is the optimisation step on top of whatever prompt you would otherwise write. For exact figures, read the paper and the optimiser papers directly; we do not reprint a per-benchmark lift table here because the numbers are configuration-specific (optimiser, model, training-set size) and a bare table of percentages strips away the harness that makes them meaningful.
Q.04How much does DSPy improve scores in practice?+
It depends heavily on the task and the baseline, and no single number generalises. The framework adds the most value when (a) the task has structured outputs amenable to programmatic verification, (b) you have a few hundred to a few thousand training examples to optimise against, and (c) the baseline prompt is moderately sophisticated but not exhaustively tuned. On tasks where you can hand-tune a prompt to near-saturation, the framework's lift is small; on tasks where prompt design is non-obvious, the lift is more meaningful. Measure it on your own task rather than trusting a headline percentage.
Q.05Is DSPy production-ready in 2026?+
Yes, with caveats. The framework has matured significantly since its 2023 introduction, has clean documentation, and is used in production by several teams. The main caveats: (a) the framework's value depends on having labelled training data, which adds operational complexity; (b) optimised prompts can be brittle when the underlying model is updated, requiring re-compilation; (c) the framework's abstractions take some time to learn and the mental model is genuinely different from agent-orchestration frameworks. For teams that have the data and the time, DSPy's optimisation lift is worth the investment.
Q.06Can I combine DSPy with LangGraph or AutoGen?+
Yes. The combination is increasingly common in production. DSPy compiles individual reasoning steps (e.g. 'given a research question, generate a sub-query plan'); LangGraph or AutoGen orchestrates the broader workflow that calls these compiled steps. The result is an agent that uses the orchestration framework's structural strengths (state, conversation, crew) and DSPy's optimisation strengths (programmatically tuned prompts inside each step). This pattern is documented in the DSPy + LangChain integration examples.
Agent Benchmarks OverviewLangGraph BenchmarksAutoGen BenchmarksCrewAI BenchmarksOpenAI Agents SDKRAG Benchmarks ComparedRAG Evaluation

Sources

  1. [1] Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
  2. [2] DSPy project site. dspy.ai.
  3. [3] DSPy repository. github.com/stanfordnlp/dspy.
  4. [4] Opsahl-Ong, K. et al. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. The MIPROv2 paper. arXiv:2406.11695.
From the editor

Benchmarking Agents Review is published by Digital Signet, an independent firm that builds and ships AI agents in production for mid-market companies. If you are evaluating, designing, or productionising LLM agents and want a working second opinion, get in touch.

Book a 30-min scoping callDigital Signet →

30 minutes, free, independent.·1-page action plan within 48h.·Honest if not the right fit.