DSPY · TECH

DSPy: programming instead of prompting – the Stanford approach to LLM pipelines

DSPy in May 2026 in v2.5+ is an MIT framework from Stanford. Instead of writing prompts, you define tasks – the system optimises prompts automatically. Production-capable for complex multi-step pipelines.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is DSPy?

DSPy is an open-source framework breaking with the classic prompt-engineering approach for LLM applications. Developed since 2023 at Stanford NLP Lab under Omar Khattab (creator of ColBERT). MIT license, Python only. In May 2026 in version 2.5+, with "production ready" status – the academic phase is over, the library has proven itself in several large companies.

The fundamental difference to LangChain and LlamaIndex: in DSPy you do not write prompts. You define tasks (signatures) as Python classes or strings in the format input -> output. The DSPy engine then optimises prompts automatically via few-shot learning from a small training dataset. Instead of spending hours on prompt tuning, you write task definitions and let optimisation run.

Core concepts: Signature (input -> output definition, e.g. "question -> answer"), Module (a composition of signatures, e.g. ChainOfThought, ReAct, Predict), Optimizer (algorithms like BootstrapFewShot, MIPROv2, BootstrapFinetune), Compiler (combines modules and optimisers, produces final pipeline state). In May 2026 MIPROv2 is the state-of-the-art optimiser and delivers the best results in most cases.

The learning curve is the main barrier. DSPy thinks differently from the prompt-engineering tradition. Anyone coming from LangChain/LlamaIndex typically needs 1-2 weeks of onboarding before the DSPy world clicks. Anyone from the ML field (PyTorch experience) feels at home faster, because DSPy is conceptually built like PyTorch-for-LLMs.

In May 2026 the main users are Stanford itself (research pipelines), some hedge funds (complex financial-analysis pipelines), pharma companies (literature research and structuring), and some specialised AI consultancies. In the typical CH SME segment DSPy is rare in May 2026 – the learning curve does not match the fast-pilot model.

Why it matters

For complex multi-step LLM pipelines, DSPy in May 2026 is the technically most interesting framework. Reason: classic prompt engineering scales poorly. Anyone building a pipeline with 5 LLM calls (e.g. research -> fact check -> synthesis -> language review -> final formatting) must write prompts for each step, test them, iterate. A change in LLM (gpt-4o -> gpt-4o-mini -> Mistral) means all prompts need re-tuning. Maintenance hell.

DSPy abstracts this problem. The task signatures stay the same, the optimised prompts are recompiled automatically on model switch. A model switch is conceptually as simple as changing the compile config. For long-lived pipelines a significant advantage – even if the initial learning curve is higher.

For Swiss fiduciaries the question is: when does this learning curve pay off? Three constellations speak for DSPy. First: complex pipelines with 5+ LLM steps (e.g. a tax check with multiple sources, validation, summary). Second: productive pipelines with accuracy demand (e.g. invoice recognition with structuring – where every percent of improvement counts). Third: when model switches are frequent (e.g. a pilot starts with GPT-4o, later switches to Mistral EU or local Llama).

For simple RAG pipelines DSPy is disproportionate – LlamaIndex is faster done and the accuracy gain marginal. In May 2026 many teams use DSPy for the "hard" pipeline stages (complex reasoning steps) and LlamaIndex for standard RAG retrieval, combined. In May 2026 a common pattern: DSPy as optimisation layer over a LlamaIndex retrieval.

Critical note: DSPy optimisation needs a training set. At least 30-50 real input examples with desired outputs. Anyone without that set or unwilling to build it should not use DSPy – the optimisers have nothing to learn from.

How it works

The DSPy workflow follows three phases: define, compile, deploy.

Phase 1: define the task. Example for a simple Q&A pipeline:

import dspy

lm = dspy.OpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"]) dspy.settings.configure(lm=lm)

class GenerateAnswer(dspy.Signature): """Answers a client question based on context.""" context = dspy.InputField(desc="Relevant knowledge chunks") question = dspy.InputField() answer = dspy.OutputField(desc="Precise answer in German")

class RAG(dspy.Module): def __init__(self, num_passages=5): super().__init__() self.retrieve = dspy.Retrieve(k=num_passages) self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

def forward(self, question): context = self.retrieve(question).passages return self.generate_answer(context=context, question=question)

Phase 2: compile. An optimisation with MIPROv2 needs a training set:

from dspy.teleprompt import MIPROv2

trainset = [ dspy.Example(question="What AHV contributions apply in 2026?", answer="...").with_inputs("question"), # ... at least 30 examples ]

def validate(example, pred, trace=None): return dspy.evaluate.answer_exact_match(example, pred)

teleprompter = MIPROv2(metric=validate, auto="medium") compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

The compile phase runs minutes to hours. MIPROv2 generates several prompt variants, tries them, keeps the best. The result is an optimised module with few-shot examples in the prompts.

Phase 3: deploy. The compiled module is called like a normal Python object:

response = compiled_rag("What AHV contributions apply in 2026?") print(response.answer)

The module can be serialised via compiled_rag.save("rag_v1.json") and deployed to production.

Advanced modules: ReAct for tool calling, ProgramOfThought for code generation, MultiHopReasoning for complex research pipelines. DSPy ships retriever adapters for Pinecone, Qdrant, Weaviate, ColBERT, Chroma.

DSPy tracing integrates over MLflow or Phoenix. Each pipeline execution is recorded with module calls, prompts, token usage, and latency. In May 2026 Phoenix (Arize AI, open source) is the recommended tracing solution.

Important: DSPy-compiled pipelines are not model-portable. When you switch LLM you must recompile – the optimised few-shot examples often do not transfer directly. That is the flip side of the optimisation approach.

DSPy setup in 5 steps

01Check the use case: 5+ LLM steps? Accuracy critical? Training data available (30-50 real examples)? If not all yes – LlamaIndex instead of DSPy.
02Define signatures: each LLM step as input -> output. Clear descriptions, choose ChainOfThought or ReAct as module wrapper.
03Build training set: 30-50 example inputs with desired outputs. Define validation metric (exact match, F1, custom logic).
04Compile with MIPROv2: auto="light" for fast tests, auto="medium" for production optimisation, auto="heavy" for max quality. Mind compile duration (10 min to several hours).
05Set up tracing: Phoenix (Arize AI) or MLflow for pipeline visualisation. Monitor token usage and latency per module call. On model switch recompile with current training set.

When to use DSPy

DSPy is the right choice when (a) the pipeline covers 5+ LLM steps, (b) accuracy is critical and optimisation effort pays off, or (c) the team is research-oriented with PyTorch experience.

Concrete cases: a financial analyst builds a pipeline from research sources + fact check + table extraction + report synthesis + quality check – DSPy with MIPROv2 optimises the steps jointly. A law office wants contract analysis with clause classification, risk rating, and recommendation – DSPy as multi-step pipeline with few-shot examples from existing data. An SME receives 100 invoices per day and wants to process them into structured posting records – DSPy for extraction + validation + categorisation.

For academic and research projects, DSPy in May 2026 is the default framework. Most new Stanford and MIT papers on LLM reasoning use DSPy as basis. Anyone in research will encounter it.

For A/B tests of different models (gpt-4o vs. Mistral vs. Llama) DSPy is elegant – same signature, three compile configs, three comparison pipelines with consistent optimisation.

When not to use

For simple RAG pipelines with one or two LLM steps, DSPy is over-sized. LlamaIndex is faster done and the accuracy gain minimal.

For teams without Python depth or ML background, the DSPy learning curve is disproportionate. Anyone needing a productive bot in 2-3 weeks should choose LlamaIndex or LangChain.

For use cases without training data (e.g. a pilot without example outputs) DSPy cannot optimise. At least 30-50 example pairs are needed – anyone without should first collect the data.

For agent workflows with complex state management and multi-tool calls, LangGraph is stronger. DSPys ReAct module covers it too, but LangGraph in May 2026 is more mature.

For no-code setups DSPy is unsuited – everything is Python code, no visual builder. Flowise or Langflow are the choice.

For extremely latency-critical applications (sub-second), the DSPy pipeline through its module layer is slightly slower than a direct LLM call. At critical latency, a custom build with the raw LLM SDK is faster.

When LLM providers change frequently and pipeline recompilation is not acceptable: DSPy pipelines are not model-portable. On model switch a recompile with the training set is needed – typical 10-60 minutes of compile time, plus validation.

Trade-offs

STRENGTHS

Optimisation instead of manual prompt tuning – scales better at complex pipelines
Clear separation of task definition and prompt implementation
Stanford backing, academically grounded, production-ready in May 2026
Elegant A/B testing of different models via shared signature

WEAKNESSES

Steep learning curve – 1-2 weeks onboarding from a LangChain background
Training data (30-50 examples) mandatory – nothing without
Pipelines not model-portable – recompile on model switch
Compile phase takes minutes to hours – no fast iteration cycle

FAQ

Is DSPy worth it for a Swiss SME?

Rarely. DSPy is academically very interesting and production-capable in May 2026, but the learning curve does not fit the fast SME pilot model. Worth it for complex multi-step pipelines with high accuracy demand – e.g. a tax check with 5 steps. For standard RAG LlamaIndex is faster productive.

How many training examples are needed?

At least 30-50, optimal 100-200. With fewer than 30 MIPROv2 cannot reliably optimise. Examples must cover the variety of real cases – if the use case has 5 client types the training set should contain all 5.

Are compiled pipelines model-portable?

No. DSPy optimisation binds to the model. On switch from gpt-4o to Mistral you must recompile. That is also the point – DSPy optimises few-shot prompts model-specifically. Pragmatically: signatures stay the same, compile step becomes part of the CI/CD pipeline.

DSPy plus LlamaIndex – does that work?

Yes, in May 2026 a common pattern. LlamaIndex for data loading, chunking, embedding, and retrieval. DSPy for the heavy reasoning steps after retrieval (answer generation, fact check, synthesis). Both use Python and integrate via shared data structures.

Sources

stanfordnlp/dspy – GitHub repository and releases · 2026-05
DSPy documentation – signatures, modules, optimisers · 2026-05
DSPy paper – Compiling Declarative Language Model Calls · 2026-04
MIPROv2 paper – Optimizing Instructions and Demonstrations · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call