Abstract

WhatProgrammatic prompt-optimisation framework: write pipelines as typed programs, compile to optimised prompts

WhoKhattab et al., Stanford NLP, 2023 (arXiv:2310.03714)

Bench TierLift +5-25 points over hand-prompted baselines on multi-hop RAG and structured reasoning.

Section II.xi · Agent Frameworks|Last verified April 2026

DSPy: Programmatic Prompt Optimisation, Not Just Orchestration

The framework that treats prompt engineering as a compilation step rather than a hand-tuning step. Stanford-led, optimisation-first design, real lifts on multi-hop QA and structured reasoning. A different category from LangGraph and AutoGen: not orchestration, but programmatic prompt and pipeline optimisation that you use alongside an orchestration framework or on its own.

What DSPy is

DSPy, introduced by Khattab et al. at Stanford NLP in late 2023, is a framework for programming language model pipelines as typed Python programs. The defining abstraction is the signature: a declaration of input fields, output fields, and an optional task description. A DSPy program composes signatures into a pipeline; the framework then compiles the pipeline by generating and optimising prompts using one of several optimiser algorithms.

The key conceptual shift is treating prompt engineering as compilation. Other frameworks ask you to write the prompt by hand, sometimes with a few-shot template, and then call the language model. DSPy asks you to write the program structure, then runs an optimisation step that produces the prompts automatically using your training data and chosen optimiser. The result is prompts you would not have written by hand, often substantially better than hand-tuned baselines.

DSPy is a different category from agent-orchestration frameworks. LangGraph, AutoGen, CrewAI, and the OpenAI Agents SDK are about defining agents, tools, and execution loops; the prompts inside each step are written by you. DSPy is about generating those prompts programmatically. The two frameworks compose: production teams increasingly use LangGraph or AutoGen for orchestration and DSPy for prompt optimisation inside each reasoning step.

The optimisation surface

DSPy ships with several optimisers, each suited to different program shapes and data budgets. Choosing the right optimiser is the most important decision when starting with the framework; the wrong choice can compile slowly without much improvement, while the right choice can lift scores significantly.

Optimiser

When it fits

BootstrapFewShot

Generates few-shot examples by running the program on training data and bootstrapping correct trajectories. The simplest and most-used DSPy optimiser; good baseline for any new program.

BootstrapFewShotWithRandomSearch

Adds random search over few-shot example selection. Modest improvement over plain BootstrapFewShot at moderate compilation cost.

MIPROv2

Multi-step prompt optimisation using LLM-generated prompt candidates and Bayesian optimisation. The strongest general-purpose DSPy optimiser in 2026; recommended for serious applications.

COPRO

Coordinated prompt optimisation that iteratively refines instructions. Useful when the task benefits from instruction tuning rather than few-shot example tuning.

BootstrapFinetune

Bootstraps trajectories then fine-tunes the model on them (rather than only tuning prompts). Most powerful but requires a tunable model and meaningful compute budget.

MIPROv2 is the recommended default for serious applications in 2026. BootstrapFewShot is the right starting point when you are evaluating the framework or want a quick baseline. BootstrapFinetune is the most powerful but requires a fine-tunable model and substantial compute, which most teams will not invest in. The choice depends on your data budget and your tolerance for compilation latency.

Benchmark coverage

DSPy benchmark numbers are usually reported as "X percent lift over baseline" rather than absolute SOTA, because the framework's contribution is the optimisation on top of whatever prompt you would otherwise write. The picture below summarises the best-documented lifts from the original paper, follow-up work, and the community.

Benchmark

Reported lift

Note

GSM8K (mathematics)

DSPy compilation +5-15 points over baseline

Original paper benchmark. Optimised chain-of-thought programs improve over hand-prompted baselines reliably.

HotPotQA (multi-hop QA)

Optimised retrieve-then-reason programs +10-20 points

DSPy's strongest published wins; multi-hop RAG benefits substantially from programmatic optimisation.

MultiHopRAG

Compiled retriever + reader pipelines +10-25 points

Optimisation impact is largest on multi-step RAG where the prompt shape is non-obvious.

BIG-Bench Hard

Variable; small improvements over CoT baseline

Mixed-task benchmark; DSPy's lift varies by sub-task. Best on structured reasoning tasks.

SWE-bench Verified

Less common; framework not specifically tuned for code generation

DSPy is rarely the primary framework for SWE-bench; usually used inside other frameworks for prompt optimisation.

Where DSPy wins

DSPy adds the most value on three task types. First, multi-hop QA and RAG: HotPotQA and MultiHopRAG have prompt shapes that are non-obvious to hand-tune (how to phrase the sub-query, how to format the retrieved context, how to chain the inference), and DSPy's optimisation reliably finds prompts better than hand-tuned baselines. Lifts of 10-25 points are routine.

Second, structured reasoning with verifiable outputs. Math benchmarks like GSM8K, code benchmarks where output correctness can be checked programmatically, and any task where the optimiser has an automated success signal to compile against. The verifiability matters because DSPy's optimisers need a reliable evaluation function; tasks where success is judged by a noisy LLM or by humans are harder to compile against.

Third, pipelines with multiple LLM calls. A single-call task often benefits little from DSPy because there is one prompt to optimise and the gain over careful hand-tuning is marginal. A pipeline with three or four LLM calls (decompose, retrieve, reason, synthesise) benefits substantially because the joint optimisation across calls captures interactions that would be impossible to hand-tune.

Where DSPy adds less

DSPy adds less value on three task types. First, single-call tasks where the prompt is well-understood. A simple summarisation or classification task with a hand-tuned prompt is often near-saturation; the framework's lift is small and the operational cost (training data, compilation latency, deployment complexity) is not justified.

Second, tasks without good evaluation functions. Open-ended creative writing, free-form Q&A judged by humans, or anything graded by an unreliable LLM-as-judge: DSPy's optimisation depends on a stable evaluation signal; without one, the framework can compile to prompts that maximise the noisy signal without genuinely improving the task.

Third, agent workflows where the bottleneck is orchestration rather than prompt design. If the work decomposes into clear roles (use CrewAI), needs explicit state and branching (use LangGraph), or benefits from multi-agent debate (use AutoGen), the orchestration framework solves the bigger problem and DSPy adds marginal benefit on top.

Production patterns and operational considerations

Three patterns recur in production DSPy deployments. First, compile-once-deploy-many: run the optimisation step in CI when the model or training data changes, then deploy the compiled prompts as static templates. This avoids paying the optimisation cost in the request path. Second, compile-on-update: when a model version changes (e.g. moving from Claude 4 to Claude 4.5), re-compile the program against the new model because optimised prompts are model-specific. Third, hybrid hand-and-compile: hand-tune the high-level program structure and let DSPy optimise the prompts inside each step.

Operational considerations: the framework requires labelled training data (typically a few hundred to a few thousand examples); compilation can take minutes to hours depending on the optimiser and program size; compiled prompts can be brittle when the model is updated, requiring re-compilation; the framework's mental model takes time to learn (the typed-signature abstraction is unusual). For teams that have the data, time, and patience to learn the abstractions, DSPy's lift is genuine and worth the investment.

When to use DSPy in 2026

Use DSPy when (a) the task has structured outputs amenable to programmatic verification, (b) you have a few hundred or more labelled training examples, (c) the prompt design is non-obvious enough that hand-tuning would be a substantial effort, and (d) you are comfortable with the framework's mental model. The clearest wins are on multi-hop QA, RAG pipelines, and structured-reasoning tasks where the prompt shape is genuinely hard to design.

Skip DSPy when the task is simple enough to hand-tune to near-saturation, when you lack labelled training data, when the evaluation signal is too noisy to optimise against, or when the bottleneck is orchestration rather than prompt design. The framework is the right tool for a specific class of problem; it is not a general agent-orchestration framework and should not be evaluated against LangGraph or AutoGen on the same axis.

Editor's verdictDSPy is a different category from agent-orchestration frameworks: it compiles prompts programmatically rather than asking you to write them by hand. Real lifts of 5-25 points on multi-hop QA, structured reasoning, and complex RAG. Use alongside LangGraph or AutoGen, not as a replacement.

Reader Questions

Q.01What is DSPy?+

DSPy is the Stanford-led framework for programming language model pipelines, introduced by Khattab et al. in 2023 (arXiv:2310.03714) and steadily expanded since. The defining idea is to write LM pipelines as Python programs of typed signatures (input fields, output fields) rather than as prompts, then compile the program into optimised prompts using training data and explicit optimisation algorithms. The framework treats prompt engineering as a compilation step, not as a hand-tuning step.

Q.02How is DSPy different from other agent frameworks?+

DSPy is fundamentally a different category. LangGraph, AutoGen, CrewAI, and the OpenAI SDK are agent-orchestration frameworks: they provide primitives for defining agents, tools, and execution loops, but the prompts inside each agent step are written by hand. DSPy is a programmatic-prompt-optimisation framework: it generates and optimises the prompts programmatically using algorithms like BootstrapFewShot, MIPROv2, or COPRO. You can use DSPy alongside an orchestration framework, but DSPy's distinct value is the optimisation, not the orchestration.

Q.03What benchmarks does DSPy publish?+

The original DSPy paper reported on GSM8K (mathematics), HotPotQA (multi-hop QA), and a few smaller datasets. The team has expanded coverage since to include MultiHopRAG, BIG-Bench Hard tasks, and the SCONE multi-step reasoning benchmark. DSPy's value proposition is typically reported as 'X percent improvement over hand-prompted baseline' rather than absolute SOTA, because the framework's contribution is the optimisation lift on top of whatever prompt you would otherwise write by hand.

Q.04How much does DSPy improve scores in practice?+

Reported lifts range from 5 percent to 25 percent depending on the task and the baseline. The framework adds the most value when (a) the task has structured outputs amenable to programmatic verification, (b) you have a few hundred to a few thousand training examples to optimise against, and (c) the baseline prompt is moderately sophisticated but not exhaustively tuned. On tasks where you can hand-tune a prompt to near-saturation, the framework's lift is small; on tasks where prompt design is non-obvious, the lift is meaningful.

Q.05Is DSPy production-ready in 2026?+

Yes, with caveats. The framework has matured significantly since its 2023 introduction, has clean documentation, and is used in production by several teams. The main caveats: (a) the framework's value depends on having labelled training data, which adds operational complexity; (b) optimised prompts can be brittle when the underlying model is updated, requiring re-compilation; (c) the framework's abstractions take some time to learn and the mental model is genuinely different from agent-orchestration frameworks. For teams that have the data and the time, DSPy's optimisation lift is worth the investment.

Q.06Can I combine DSPy with LangGraph or AutoGen?+

Yes. The combination is increasingly common in production. DSPy compiles individual reasoning steps (e.g. 'given a research question, generate a sub-query plan'); LangGraph or AutoGen orchestrates the broader workflow that calls these compiled steps. The result is an agent that uses the orchestration framework's structural strengths (state, conversation, crew) and DSPy's optimisation strengths (programmatically tuned prompts inside each step). This pattern is documented in the DSPy + LangChain integration examples.

Agent Benchmarks Overview →LangGraph Benchmarks →AutoGen Benchmarks →CrewAI Benchmarks →OpenAI Agents SDK →RAG Benchmarks Compared →RAG Evaluation →

Sources

[1] Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
[2] DSPy project site. dspy.ai. Accessed May 2026.
[3] DSPy repository. github.com/stanfordnlp/dspy.
[4] Opsahl-Ong, K. et al. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. The MIPROv2 paper. arXiv:2406.11695.