DSPy: Programmatic Prompt Optimisation, Not Just Orchestration
The framework that treats prompt engineering as a compilation step rather than a hand-tuning step. Stanford-led, optimisation-first design, real lifts on multi-hop QA and structured reasoning. A different category from LangGraph and AutoGen: not orchestration, but programmatic prompt and pipeline optimisation that you use alongside an orchestration framework or on its own.
What DSPy is
DSPy, introduced by Khattab et al. at Stanford NLP in late 2023, is a framework for programming language model pipelines as typed Python programs. The defining abstraction is the signature: a declaration of input fields, output fields, and an optional task description. A DSPy program composes signatures into a pipeline; the framework then compiles the pipeline by generating and optimising prompts using one of several optimiser algorithms.
The key conceptual shift is treating prompt engineering as compilation. Other frameworks ask you to write the prompt by hand, sometimes with a few-shot template, and then call the language model. DSPy asks you to write the program structure, then runs an optimisation step that produces the prompts automatically using your training data and chosen optimiser. The result is prompts you would not have written by hand, often substantially better than hand-tuned baselines.
DSPy is a different category from agent-orchestration frameworks. LangGraph, AutoGen, CrewAI, and the OpenAI Agents SDK are about defining agents, tools, and execution loops; the prompts inside each step are written by you. DSPy is about generating those prompts programmatically. The two frameworks compose: production teams increasingly use LangGraph or AutoGen for orchestration and DSPy for prompt optimisation inside each reasoning step.
The optimisation surface
DSPy ships with several optimisers, each suited to different program shapes and data budgets. Choosing the right optimiser is the most important decision when starting with the framework; the wrong choice can compile slowly without much improvement, while the right choice can lift scores significantly.
MIPROv2 is the recommended default for serious applications in 2026. BootstrapFewShot is the right starting point when you are evaluating the framework or want a quick baseline. BootstrapFinetune is the most powerful but requires a fine-tunable model and substantial compute, which most teams will not invest in. The choice depends on your data budget and your tolerance for compilation latency.
Benchmark coverage
DSPy benchmark numbers are usually reported as "X percent lift over baseline" rather than absolute SOTA, because the framework's contribution is the optimisation on top of whatever prompt you would otherwise write. The picture below summarises the best-documented lifts from the original paper, follow-up work, and the community.
Where DSPy wins
DSPy adds the most value on three task types. First, multi-hop QA and RAG: HotPotQA and MultiHopRAG have prompt shapes that are non-obvious to hand-tune (how to phrase the sub-query, how to format the retrieved context, how to chain the inference), and DSPy's optimisation reliably finds prompts better than hand-tuned baselines. Lifts of 10-25 points are routine.
Second, structured reasoning with verifiable outputs. Math benchmarks like GSM8K, code benchmarks where output correctness can be checked programmatically, and any task where the optimiser has an automated success signal to compile against. The verifiability matters because DSPy's optimisers need a reliable evaluation function; tasks where success is judged by a noisy LLM or by humans are harder to compile against.
Third, pipelines with multiple LLM calls. A single-call task often benefits little from DSPy because there is one prompt to optimise and the gain over careful hand-tuning is marginal. A pipeline with three or four LLM calls (decompose, retrieve, reason, synthesise) benefits substantially because the joint optimisation across calls captures interactions that would be impossible to hand-tune.
Where DSPy adds less
DSPy adds less value on three task types. First, single-call tasks where the prompt is well-understood. A simple summarisation or classification task with a hand-tuned prompt is often near-saturation; the framework's lift is small and the operational cost (training data, compilation latency, deployment complexity) is not justified.
Second, tasks without good evaluation functions. Open-ended creative writing, free-form Q&A judged by humans, or anything graded by an unreliable LLM-as-judge: DSPy's optimisation depends on a stable evaluation signal; without one, the framework can compile to prompts that maximise the noisy signal without genuinely improving the task.
Third, agent workflows where the bottleneck is orchestration rather than prompt design. If the work decomposes into clear roles (use CrewAI), needs explicit state and branching (use LangGraph), or benefits from multi-agent debate (use AutoGen), the orchestration framework solves the bigger problem and DSPy adds marginal benefit on top.
Production patterns and operational considerations
Three patterns recur in production DSPy deployments. First, compile-once-deploy-many: run the optimisation step in CI when the model or training data changes, then deploy the compiled prompts as static templates. This avoids paying the optimisation cost in the request path. Second, compile-on-update: when a model version changes (e.g. moving from Claude 4 to Claude 4.5), re-compile the program against the new model because optimised prompts are model-specific. Third, hybrid hand-and-compile: hand-tune the high-level program structure and let DSPy optimise the prompts inside each step.
Operational considerations: the framework requires labelled training data (typically a few hundred to a few thousand examples); compilation can take minutes to hours depending on the optimiser and program size; compiled prompts can be brittle when the model is updated, requiring re-compilation; the framework's mental model takes time to learn (the typed-signature abstraction is unusual). For teams that have the data, time, and patience to learn the abstractions, DSPy's lift is genuine and worth the investment.
When to use DSPy in 2026
Use DSPy when (a) the task has structured outputs amenable to programmatic verification, (b) you have a few hundred or more labelled training examples, (c) the prompt design is non-obvious enough that hand-tuning would be a substantial effort, and (d) you are comfortable with the framework's mental model. The clearest wins are on multi-hop QA, RAG pipelines, and structured-reasoning tasks where the prompt shape is genuinely hard to design.
Skip DSPy when the task is simple enough to hand-tune to near-saturation, when you lack labelled training data, when the evaluation signal is too noisy to optimise against, or when the bottleneck is orchestration rather than prompt design. The framework is the right tool for a specific class of problem; it is not a general agent-orchestration framework and should not be evaluated against LangGraph or AutoGen on the same axis.
Q.01What is DSPy?+
Q.02How is DSPy different from other agent frameworks?+
Q.03What benchmarks does DSPy publish?+
Q.04How much does DSPy improve scores in practice?+
Q.05Is DSPy production-ready in 2026?+
Q.06Can I combine DSPy with LangGraph or AutoGen?+
Sources
- [1] Khattab, O. et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
- [2] DSPy project site. dspy.ai. Accessed May 2026.
- [3] DSPy repository. github.com/stanfordnlp/dspy.
- [4] Opsahl-Ong, K. et al. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. The MIPROv2 paper. arXiv:2406.11695.