By Adrien Payong and Shaoni Mukherjee

LLM application development has grown past simple prompt engineering. As systems become more complex, you need a stronger mental model to structure reasoning, retrieval, tool use, evaluation, and optimization within one maintainable workflow. DSPy was designed to help with that. Rather than manually tuning lengthy prompt templates, you define signatures, compose modules, and then optimize the entire program against a metric. This makes LLM development feel less like prompt trial and error and more like building a measurable, improvable software pipeline.
This article covers practical DSPy use cases you will encounter when building production-quality applications. We dive into how DSPy enables question answering, retrieval-augmented generation, multi-step reasoning agents, text classification, and much more. Along the way, you’ll learn about DSPy’s approach to metric evaluation, assertion-style constraints, and choosing an optimizer. By the end, you should have a clearer view of how DSPy can help you move from isolated prompts to scalable, structured, production-ready LLM pipelines.
DSPy’s design philosophy is to program declarative LM programs (signatures, modules, and control flow), then compile them towards a metric, rather than manually engineering long prompt templates.
The authors of DSPy reframe this as compiling declarative LM calls into self-improving pipelines, as in the original paper. The compile step searches for better instructions, few-shot demonstrations, (in some modes) finetuned weights. Doing DSPy in practice tends to look more like “lightweight ML” than prompt engineering:
DSPy is often compared to orchestration frameworks, such as LangChain, and data-centric RAG frameworks, like LlamaIndex. One helpful way to think about their differences is:
Many real-world production stacks combine these approaches: use LlamaIndex (or another retriever) to power ingestion and retrieval, then utilize DSPy to wrap the generation and routing logic to optimize prompts and typed outputs.
Signatures describe what the model should do: input fields, output fields, and their semantic names. Optionally specify types and instructions. Field names are important because they indicate the role (“question” vs “answer”, “context” vs “summary”, etc).
Modules define how to solve it. Key ones:
Adapters determine how “structured” your LM I/O is. ChatAdapter is DSPy’s default field-marker format. JSONAdapter forces models that support structured output formatting to emit JSON so that you can reliably parse typed outputs.
This code implements a small but realistic “router” program which brings together Predict, RAG + ChainOfThought, and ReAct end-to-end flow:
# pip install -U dspy (or: pip install -U dspy-ai)
import dspy
from typing import Literal
# 1) Configure the language model once near the top of your app.
lm = dspy.LM("openai/gpt-4o-mini") # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())
# 2) A small intent classifier (Predict) to route requests.
class Route(dspy.Signature):
"""Route the user request to the best handler."""
query: str = dspy.InputField()
intent: Literal["rag_qa", "tool_agent", "direct_qa"] = dspy.OutputField()
router = dspy.Predict(Route)
# 3) A RAG-style answerer (we'll implement it fully later).
class RagAnswer(dspy.Signature):
"""Answer using only the provided context passages."""
context: list[str] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
citations: list[int] = dspy.OutputField(desc="indices of context passages used")
rag_answerer = dspy.ChainOfThought(RagAnswer)
# 4) A ReAct agent with tools (we'll implement tools later).
def add(a: float, b: float) -> float:
return a + b
agent = dspy.ReAct(signature="question -> answer", tools=[add], max_iters=8)
# 5) Tie it together as a program.
class UnifiedAssistant(dspy.Module):
def forward(self, query: str, retrieved_passages: list[str] | None = None):
route = router(query=query).intent
if route == "rag_qa":
ctx = retrieved_passages or []
return rag_answerer(context=ctx, question=query)
if route == "tool_agent":
return agent(question=query)
# default: direct QA, still using a CoT-style module for robustness
direct = dspy.ChainOfThought("question -> answer")
return direct(question=query)
assistant = UnifiedAssistant()
The above script builds a lightweight DSPy assistant capable of serving multiple types of user queries within a single workflow. After setting up an LLM and JSON adapter, it creates a Predict router that classifies which of three intents a new query belongs to: RAG-based question answering, tool-based agent reasoning, or direct question answering. Queries that require external knowledge are routed to a ChainOfThought RAG module that answers the question given retrieved passages, and returns citations. Queries that require tool usage are routed to a ReAct agent coupled with an add tool; all other queries fall back to a direct ChainOfThought answer module. This program demonstrates how DSPy can orchestrate routing, retrieval, reasoning, and tool use within a single modular assistant.
By default, the DSPy ChainOfThought module is designed towards problems where providing intermediate reasoning improves correctness. Let’s consider the following code:
import os
import dspy
from dspy.evaluate import Evaluate
from dspy.evaluate.metrics import answer_exact_match
# Configure once per process.
# (OPENAI_API_KEY must be set in your environment.)
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# A minimal CoT QA module.
qa_cot = dspy.ChainOfThought("question -> answer")
# A tiny devset (start small, then grow).
devset = [
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
]
# Metric: exact match on the final answer field.
def em_metric(example, pred, trace=None):
return answer_exact_match(example, pred)
evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
baseline = evaluator(qa_cot, metric=em_metric)
print("Baseline score:", baseline)
This program set up a small DSPy question-answering evaluation pipeline. It initializes DSPy with the openai/gpt-4o-mini model, then defines a simple ChainOfThought module which accepts a question and generates an answer. The program defines a small development dataset consisting of two example QA pairs and builds an exact-match metric for evaluating the predicted answer against the expected ones. It then launches DSPy’s Evaluate utility to apply that module to each question in the dataset in parallel. It computes and outputs the baseline score, indicating how accurately the unoptimized Chain-of-Thought QA module answered those sample questions.
If you only have a few examples, BootstrapFewShot is a good starting point. This optimizer composes demos from labeled examples + bootstrapped demos created by a teacher, filtering to only keep demos that pass your metric.
from dspy.teleprompt import BootstrapFewShot
# A very small trainset is acceptable (DSPy is designed to start small).
trainset = devset
teleprompter = BootstrapFewShot(
metric=em_metric,
max_bootstrapped_demos=2,
max_labeled_demos=2,
)
qa_optimized = teleprompter.compile(student=qa_cot, trainset=trainset)
optimized_score = evaluator(qa_optimized, metric=em_metric)
print("Optimized score:", optimized_score)
Here, we improved the original qa_cot question-answering module with DSPy’s BootstrapFewShot optimizer. We use the small trainset as learning examples for better few-shot demonstrations. Then we compiled an optimized version of the model using up to 2 bootstrapped demos + 2 labeled demos. Finally, we run an evaluation on the new model with the same exact-match metric and print out the optimized score to show whether the performance improved over the baseline.
Retrieval-Augmented Generation solves a major pain point. Without RAG, LLMs can’t access your private or continuously changing knowledge unless you directly supply it at inference time. A typical end-to-end RAG pipeline consists of ingestion/chunking, embeddings, storage + retrieval, and final generation grounded on retrieved documents.
In the following program, we define a typed signature (lists and ints), use JSONAdapter, and return citations as indices into retrieved passages.
import dspy
# Configure LM with JSONAdapter so lists (like citations)
# are parsed reliably from model output.
lm = dspy.LM("openai/gpt-4o-mini") # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())
# Minimal local corpus for demo; replace with your documents or a vector DB.
corpus = [
"Linux divides memory into regions; on 32-bit systems highmem is not permanently mapped.",
"Low memory is directly addressable by the kernel; high memory is mapped on demand.",
"Unrelated passage about iPhone apps.",
]
# Embedder for dense retrieval.
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=2)
class RagAnswer(dspy.Signature):
"""Answer using only the provided context passages."""
context: list[str] = dspy.InputField(desc="retrieved passages")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="final answer grounded in context")
citations: list[int] = dspy.OutputField(desc="indices of context passages used")
class RAG(dspy.Module):
def __init__(self):
super().__init__()
self.respond = dspy.ChainOfThought(RagAnswer)
def forward(self, question: str):
# Retrieve top‑k passages.
retrieved = search(question)
ctx = retrieved.passages
# Generate answer and citations.
pred = self.respond(context=ctx, question=question)
# Lightweight validation of citations indices.
citations = pred.citations or []
pred.citations = [i for i in citations if 0 <= i < len(ctx)]
# Return a structured prediction.
return dspy.Prediction(
context=ctx,
answer=pred.answer,
citations=pred.citations,
reasoning=pred.reasoning,
)
# Instantiate the RAG module.
rag = RAG()
# Run a demo question.
out = rag(question="What are high memory and low memory in Linux?")
print("Answer:")
print(out.answer)
print("\nCitations (indices into context):")
print(out.citations)
Here we retrieve information from a small knowledge base in order to answer a question. The language model is configured with JSONAdapter to properly parse structured output (citation lists). An embedding-based retriever is created to find the most relevant passages from the corpus. Typed Signature defines a structured RAG task with fields for context, question, answer, and citations. The RAG module follows ChainOfThought to produce a grounded answer from the retrieved passages. Lastly, the citation indices are checked for validity before returning structured prediction, and a demo query is run about Linux memory.
Here’s a small example of a composite metric. It checks if the label matches and whether the predicted answer was found in the retrieved context. It returns a float for evaluation and a boolean for bootstrapping.
from dspy.evaluate import Evaluate
def grounded_answer_metric(example, pred, trace=None):
# Case‑insensitive exact or near‑exact match on answer.
answer_match = example.answer.lower() in pred.answer.lower()
# Answer should appear in at least one retrieved passage.
context_match = any(pred.answer.lower() in c.lower() for c in pred.context)
if trace is None:
# For evaluation: soft score between 0 and 1.
return (answer_match + context_match) / 2.0
# For bootstrapping / optimization: require both.
return answer_match and context_match
devset = [
dspy.Example(
question="What is low memory in Linux?",
answer="directly addressable by the kernel",
).with_inputs("question")
]
evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
print(evaluator(rag, metric=grounded_answer_metric))
This code computes a custom metric to score how well a DSPy RAG pipeline is answering a question with grounded answers. grounded_answer_metric checks two things: 1) whether the predicted matches the expected answer, and 2) whether that answer can be grounded in the retrieved context passages. Then, Evaluate runs that metric on a small development set to validate whether your RAG pipeline returns grounded, correct answers before using it for optimization or production.
Here we use DSPy’s MIPROv2 optimizer to improve the original RAG program against your custom grounding metric, then recompile the module with a small demo set and evaluate whether the optimized version performs better.
from dspy.teleprompt import MIPROv2
# Set up MIPROv2 optimizer with your custom metric.
tp = MIPROv2(
metric=grounded_answer_metric,
auto="light", # or "medium" / "heavy"
num_threads=4,
)
# Compile the original RAG module using the dev/train set.
rag_optimized = tp.compile(
rag,
trainset=devset,
max_bootstrapped_demos=2,
max_labeled_demos=2,
)
# Re‑evaluate the optimized RAG module.
print("Evaluation after MIPROv2 optimization:")
print(evaluator(rag_optimized, metric=grounded_answer_metric))
When you have tasks that require tool use (whether that’s doing calculations, calling internal APIs, fetching knowledge, or taking actions), DSPy provides dspy.ReAct, which implements the ReAct (“Reasoning and Acting”) paradigm: the model reasons, chooses which tool to call, observes the results, and repeats until it can output final answers. ReAct can be generalized to function over any signature. It can accept either functions or dspy.Tool objects as tools.
The script below implements a small DSPy ReAct agent that answers questions by utilizing tools as needed. It sets up an LLM, defines two tools - one that returns the current UTC time and another that multiplies numbers - and passes those tools to dspy.ReAct. The agent will reason if it should use a tool, call it if needed, and then return the final answer.
import dspy
from datetime import datetime, timezone
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())
def utc_now() -> str:
return datetime.now(timezone.utc).isoformat()
def multiply(a: float, b: float) -> float:
return a * b
# Create a ReAct agent that can use utc_now and multiply.
agent = dspy.ReAct(
signature="question -> answer",
tools=[utc_now, multiply],
max_iters=6,
)
# Example queries.
print(agent(question="What time is it in UTC right now?"))
print(agent(question="What is 19.5 * 4.2?"))
Agent loops can silently accumulate high costs (repeated LLM calls, repeated tool calls) or hallucinate invalid actions without guardrails and observability. A reasonable set of guardrails includes cap iterations (max_iters), tightening tool schemas and permissions, and validating on real traffic-like prompts before rollout.
DSPy optimizers can optimize entire programs, including end-to-end complex multi-module systems (such as agents, retrieval, and extraction), as long as you specify a metric to improve. For many teams, a pattern that works well is:
Classification is an ideal DSPy use case because while success metrics (accuracy, F1) are straightforward, you can still take advantage of DSPy’s programmatic structure, typed outputs, and optimizers.
Here’s code that builds a simple DSPy text classifier for support tickets. It sets up the model, declares a signature with one input (ticket) and one constrained output (label), then calls dspy.Predict to classify the ticket as one of four types: billing, bug, feature, or security. In this example, the “I was charged twice” complaint is correctly classified as billing.
import dspy
from typing import Literal
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())
class TicketLabel(dspy.Signature):
"""Classify a support ticket into a fixed taxonomy."""
ticket: str = dspy.InputField()
label: Literal["billing", "bug", "feature", "security"] = dspy.OutputField()
clf = dspy.Predict(TicketLabel)
example = clf(ticket="I was charged twice for my subscription this month.")
print(example.label)
Metrics are ordinary Python functions. They should follow the signature(example, pred, trace=None); for complex outputs, metrics can use AI feedback via additional predictor calls.
The code below uses DSPy’s Evaluate utility to test a classifier, clf, on a small labeled dataset of support tickets. The trainset has three examples. For each example, the text of a ticket is labeled with the correct category (billing, bug, or feature). Passing .with_inputs(“ticket”) specifies to DSPy that the model should only receive the ticket text as input. The accuracy_metric function checks if the classifier’s predicted label matches the true label. It returns 1.0 if the prediction is correct and 0.0 otherwise. Evaluate runs clf on the dataset with 2 threads, displays progress while running, and print(evaluator(clf, metric=accuracy_metric)) prints the final result, which is usually the accuracy of the model on those examples.
from dspy.evaluate import Evaluate
trainset = [
dspy.Example(ticket="I was charged twice.", label="billing").with_inputs("ticket"),
dspy.Example(ticket="The app crashes on launch.", label="bug").with_inputs("ticket"),
dspy.Example(ticket="Please add export to CSV.", label="feature").with_inputs("ticket"),
]
def accuracy_metric(example, pred, trace=None):
return float(example.label == pred.label)
evaluator = Evaluate(devset=trainset, num_threads=2, display_progress=True)
print(evaluator(clf, metric=accuracy_metric))
In production, people often ask for “verification” operations: (“assertion testing”; the label must be one of X; JSON must parse; citations must be in range).
dspy.Refine was purpose-built to be a best-of-N refinement loop with reward_fn and threshold. It repeatedly calls the module N times and returns the best prediction, generating feedback between attempts if necessary. Here’s a real-world “constraint enforcement” wrapper: retry until output taxonomy is respected. Let’s consider the following code:
import dspy
from typing import Set
allowed: Set[str] = {"billing", "bug", "feature", "security"}
def label_is_valid(args, pred):
return 1.0 if pred.label in allowed else 0.0
robust_clf = dspy.Refine(module=clf, N=3, reward_fn=label_is_valid, threshold=1.0)
print(robust_clf(ticket="Please add SSO support.").label)
This code wraps the original classifier with dspy.Refine, which allows DSPy to retry up to 3 times and retain only outputs that passed reward_fn. The reward function ensures the predicted label is one of our allowed categories, and the threshold=1.0 means only a fully valid label will be accepted before returning the result.
DSPy now refers to these algorithms as optimizers (previously teleprompters). According to the optimizer documentation, an optimizer is an algorithm that tunes a DSPy program’s parameters (prompts and/or LM weights) to maximize your metrics using your program, metric, and training inputs. The training inputs are often a small set of examples.
This table lists the 3 optimizers your brief prioritizes—BootstrapFewShot, MIPROv2, and COPRO—as well as BootstrapFewShotWithRandomSearch, which DSPy recommends after you have more data.
| Optimizer | What it does and when to use it | Data guidance and key config knobs |
|---|---|---|
| BootstrapFewShot | Tunes few-shot demos assembled from labeled and bootstrapped examples validated by the metric. It works well for fast wins on small datasets and is a strong first compile option. | Start here when you have around 10 examples. Knobs: max_labeled_demos, max_bootstrapped_demos, teacher_settings |
| BootstrapFewShotWithRandomSearch | Tunes few-shot demos like BootstrapFewShot, but tests multiple candidate demo sets and keeps the best one. It is better for a more robust few-shot selection while staying relatively simple. | Best when you have around 50 or more examples. Knobs: num_candidate_programs, plus the BootstrapFewShot knobs |
| COPRO | Tunes prompt instructions through iterative search, documented as coordinate ascent in the optimizer guide. It is useful when you want instruction tuning without focusing heavily on demos. | Usually needs a train set and a metric. Knobs: breadth, depth, init_temperature |
| MIPROv2 | Jointly tunes instructions and few-shot examples using Bayesian optimization. It is the strongest choice when you want higher-quality prompt optimization and have enough budget and data. | Best for longer runs, such as 40 or more trials, with around 200 or more examples to reduce overfitting risk. Knobs: auto (“light/medium”), num_threads, plus demo knobs in compile() |
Deployment should provide you with two things: (1) infrastructure to run your DSPy program (stable runtime) and (2) access to LLMs you can reliably call to run retrieval and add guardrails.
Deploy your DSPy service to a Virtual Machine (VM) or GPU instance if you want full control of everything in your stack (vector DB, embeddings, model runtime). Building a RAG application on GPU Droplets is covered in step-by-step detail with DigitalOcean’s RAG tutorials.
Use a fully managed model access for simpler operations. The DigitalOcean Gradient platform describes serverless inference (no infrastructure management) and API access to models hosted by major vendors (OpenAI, Anthropic, etc) as well as managed scalability and security features for open-source models hosted directly in-platform.
Build agentic apps with managed agent features. DigitalOcean’s Gradient AI Platform quickstart describes fully managed agents with knowledge bases for retrieval-augmented generation, multi-agent routing, and guardrails.
Q1: What is a DSPy use case best suited for?
A DSPy use case is best suited for when: (a) you will be making repeated LM calls as part of a pipeline, (b) you can define some automatic metric for whether a call was successful, and © you’d like a systematic approach to iteratively improve prompts/demos/weights over time as models, data, or requirements change.
Q2: How does DSPy ChainOfThought differ from the Predict module?
dspy.Predict is the basic module that maps inputs to outputs using an LM and a signature. dspy.ChainOfThought is a specialized module that explicitly “reasons step by step” by prepending a reasoning field to the signature, then predicting outputs—useful when intermediate reasoning improves correctness.
Q3: What is a DSPy teleprompter or optimizer?
With terminology recently updated, what were previously called “teleprompters” are now optimizers. An optimizer is an algorithm that takes your DSPy program (which includes its prompts + underlying LM weights) + your metric + training inputs, and tunes the program’s parameters to maximize that metric.
Q4: Can DSPy be used for RAG pipelines?
Yes! DSPy’s RAG tutorial walks through building a retriever module (e.g., embeddings-based top-K search) and then composes it with a generation module within DSPy*.Module* so the whole RAG program itself can be evaluated and optimized.
Q5: What is the difference between BootstrapFewShot and MIPROv2?
BootstrapFewShot is about assembling few-shot demonstrations (labeled and bootstrapped) and evaluating them against the metric. MIPROv2 jointly optimizes both instructions + demonstrations using bootstrapped traces along with a Bayesian optimization search over the space of candidate instructions/demos. DSPy recommends starting with BootstrapFewShot when examples are scarce and graduating to MIPROv2 when you have enough data and budget to mitigate overfitting concerns.
Q6: How do DSPy assertions and metric functions work?
DSPy metrics are implemented as Python functions that accept (example, pred, trace=None) and return a score. During optimization, the metric can optionally use the trace to validate intermediate steps or enforce stricter rules. Legacy dspy.Assert / dspy.Suggest constructs are deprecated/not supported in DSPy, but you can use dspy.Refine (or dspy.Suggest) today for guided self-correction and constraint-based refinements.
Q7: Does DSPy support structured output?
Yes! When defining your DSPy signatures, you can declare typed fields (including nested, non-primitive types). Adapters are responsible for formatting those prompts appropriately and parsing outputs. JSONAdapter will prompt models to “respond with JSON” when possible for more robust structured parsing. ChatAdapter prompts include field markers to delimitate typed fields, and also support adding a JSON schema if using custom non-primitive field types.
Q8: What LLM providers does DSPy support?
DSPy supports dozens of LLM providers through its dspy.LM wrapper around LiteLLM, using simple {provider}/{model} syntax like openai/gpt-4o-mini or anthropic/claude-3-5-sonnet-20240620.
SPy represents a meaningful shift in how modern LLM systems are built. Instead of viewing prompts as static strings, DSPy treats them as components of a larger program composed of signatures, modules, metrics, and control flow. This approach really shines when you graduate from simple completions to authoring tangible application patterns such as ChainOfThought QA, RAG with structured outputs, ReAct-based tool use, and classification pipelines with integrated quality checks.
The larger point here is that DSPy isn’t simply a playground for prompt engineering. DSPy is a practical foundation for building, validating, iterating, and scaling your LLM systems with more rigor. As engineering teams require better guarantees around reliability, observability, and control over agentic behavior, DSPy will be ready to take on a larger role in production AI stacks. The future will belong to those engineers who build LLM workflows that are modular, testable, and optimization-driven from the start.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Join the many businesses that use DigitalOcean’s Gradient AI Agentic Cloud to accelerate growth. Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.