Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

DSPy Use Cases: Build Optimized LLM Pipelines

Published on April 8, 2026

AI/ML

Write for DO

By Adrien Payong and Shaoni Mukherjee

DSPy Use Cases: Build Optimized LLM Pipelines

LLM application development has grown past simple prompt engineering. As systems become more complex, you need a stronger mental model to structure reasoning, retrieval, tool use, evaluation, and optimization within one maintainable workflow. DSPy was designed to help with that. Rather than manually tuning lengthy prompt templates, you define signatures, compose modules, and then optimize the entire program against a metric. This makes LLM development feel less like prompt trial and error and more like building a measurable, improvable software pipeline.

This article covers practical DSPy use cases you will encounter when building production-quality applications. We dive into how DSPy enables question answering, retrieval-augmented generation, multi-step reasoning agents, text classification, and much more. Along the way, you’ll learn about DSPy’s approach to metric evaluation, assertion-style constraints, and choosing an optimizer. By the end, you should have a clearer view of how DSPy can help you move from isolated prompts to scalable, structured, production-ready LLM pipelines.

Key Takeaways

DSPy turns LLM development into a programmable workflow by using signatures, modules, metrics, and optimizers instead of relying on manual prompt tweaking alone.
It is especially useful for production-style pipelines that combine routing, retrieval, reasoning, tool use, structured output, and evaluation inside one maintainable system.
Core DSPy modules such as Predict, ChainOfThought, ReAct, and Module let you build practical applications like QA systems, RAG pipelines, multi-step agents, and classifiers.
DSPy optimizers such as BootstrapFewShot, MIPROv2, and COPRO help improve program quality automatically by tuning instructions and demonstrations against a metric.
For reliable deployment, DSPy works best when paired with evaluation, grounding checks, typed outputs, constraint enforcement, and stable infrastructure such as DigitalOcean for hosting models, retrieval, and agent pipelines.

What Is DSPy and Why Use It for LLM Pipelines

DSPy’s design philosophy is to program declarative LM programs (signatures, modules, and control flow), then compile them towards a metric, rather than manually engineering long prompt templates.

The authors of DSPy reframe this as compiling declarative LM calls into self-improving pipelines, as in the original paper. The compile step searches for better instructions, few-shot demonstrations, (in some modes) finetuned weights. Doing DSPy in practice tends to look more like “lightweight ML” than prompt engineering:

Define your interface: a DSPy prompt signature (inputs/outputs + types).
Implement the pipeline logic as modules (DSPy Predict module, DSPy ChainOfThought module, DSPy ReAct module, etc) + Python control flow with dspy.Module.
Define a metric function to measure quality (often calling an LLM for metric evaluation, sometimes via a DSPy “judge” program).
Run an optimizer (previously known as “teleprompters”) such as DSPy BootstrapFewShot optimizer or MIPROv2 optimizer to DSPy to improve your score.

Where DSPy fits versus LangChain and LlamaIndex

DSPy is often compared to orchestration frameworks, such as LangChain, and data-centric RAG frameworks, like LlamaIndex. One helpful way to think about their differences is:

LangChain centers around composing chains together, agents, tools, and integrations (extensive tooling for “wiring things together”).
LlamaIndex centers around data ingestion, building indexes, and querying LLM over your data (it’s built around RAG-style retrievers + query engines).
DSPy emphasizes programmatic optimization of the LM behavior within your stack: signatures, modules, metrics, and optimizers that can automatically improve your prompts/demos throughout the system.

Many real-world production stacks combine these approaches: use LlamaIndex (or another retriever) to power ingestion and retrieval, then utilize DSPy to wrap the generation and routing logic to optimize prompts and typed outputs.

DSPy core building blocks you will use in this tutorial

Signatures describe what the model should do: input fields, output fields, and their semantic names. Optionally specify types and instructions. Field names are important because they indicate the role (“question” vs “answer”, “context” vs “summary”, etc).

Modules define how to solve it. Key ones:

dspy.Predict: The basic building block that maps inputs → outputs using an LM. Configured by a signature.
dspy.ChainOfThought: A predictor that reasons step-by-step. Outputs are the same as your signature, but with an additional “reasoning” field prepended.
dspy.ReAct: An iterative “Reasoning and Acting” tool-using agent loop where the model chooses tools and produces final outputs.
dspy.Module: the base class for multi-step programs where you implement forward() and compose submodules.

Adapters determine how “structured” your LM I/O is. ChatAdapter is DSPy’s default field-marker format. JSONAdapter forces models that support structured output formatting to emit JSON so that you can reliably parse typed outputs.

Unified end-to-end pipeline example

This code implements a small but realistic “router” program which brings together Predict, RAG + ChainOfThought, and ReAct end-to-end flow:

# pip install -U dspy  (or: pip install -U dspy-ai)
import dspy
from typing import Literal
# 1) Configure the language model once near the top of your app.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())
# 2) A small intent classifier (Predict) to route requests.
class Route(dspy.Signature):
    """Route the user request to the best handler."""
    query: str = dspy.InputField()
    intent: Literal["rag_qa", "tool_agent", "direct_qa"] = dspy.OutputField()

router = dspy.Predict(Route)
# 3) A RAG-style answerer (we'll implement it fully later).
class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")

rag_answerer = dspy.ChainOfThought(RagAnswer)
# 4) A ReAct agent with tools (we'll implement tools later).
def add(a: float, b: float) -> float:
    return a + b

agent = dspy.ReAct(signature="question -> answer", tools=[add], max_iters=8)
# 5) Tie it together as a program.
class UnifiedAssistant(dspy.Module):
    def forward(self, query: str, retrieved_passages: list[str] | None = None):
        route = router(query=query).intent
        if route == "rag_qa":
            ctx = retrieved_passages or []
            return rag_answerer(context=ctx, question=query)
        if route == "tool_agent":
            return agent(question=query)
        # default: direct QA, still using a CoT-style module for robustness
        direct = dspy.ChainOfThought("question -> answer")
        return direct(question=query)
assistant = UnifiedAssistant()

The above script builds a lightweight DSPy assistant capable of serving multiple types of user queries within a single workflow. After setting up an LLM and JSON adapter, it creates a Predict router that classifies which of three intents a new query belongs to: RAG-based question answering, tool-based agent reasoning, or direct question answering. Queries that require external knowledge are routed to a ChainOfThought RAG module that answers the question given retrieved passages, and returns citations. Queries that require tool usage are routed to a ReAct agent coupled with an add tool; all other queries fall back to a direct ChainOfThought answer module. This program demonstrates how DSPy can orchestrate routing, retrieval, reasoning, and tool use within a single modular assistant.

Use Case 1: Question Answering with ChainOfThought

By default, the DSPy ChainOfThought module is designed towards problems where providing intermediate reasoning improves correctness. Let’s consider the following code:

import os
import dspy
from dspy.evaluate import Evaluate
from dspy.evaluate.metrics import answer_exact_match
# Configure once per process.
# (OPENAI_API_KEY must be set in your environment.)
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# A minimal CoT QA module.
qa_cot = dspy.ChainOfThought("question -> answer")
# A tiny devset (start small, then grow).
devset = [
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
]
# Metric: exact match on the final answer field.
def em_metric(example, pred, trace=None):
    return answer_exact_match(example, pred)
evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
baseline = evaluator(qa_cot, metric=em_metric)
print("Baseline score:", baseline)

This program set up a small DSPy question-answering evaluation pipeline. It initializes DSPy with the openai/gpt-4o-mini model, then defines a simple ChainOfThought module which accepts a question and generates an answer. The program defines a small development dataset consisting of two example QA pairs and builds an exact-match metric for evaluating the predicted answer against the expected ones. It then launches DSPy’s Evaluate utility to apply that module to each question in the dataset in parallel. It computes and outputs the baseline score, indicating how accurately the unoptimized Chain-of-Thought QA module answered those sample questions.

Improving QA with BootstrapFewShot

If you only have a few examples, BootstrapFewShot is a good starting point. This optimizer composes demos from labeled examples + bootstrapped demos created by a teacher, filtering to only keep demos that pass your metric.

from dspy.teleprompt import BootstrapFewShot
# A very small trainset is acceptable (DSPy is designed to start small).
trainset = devset
teleprompter = BootstrapFewShot(
    metric=em_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
qa_optimized = teleprompter.compile(student=qa_cot, trainset=trainset)
optimized_score = evaluator(qa_optimized, metric=em_metric)
print("Optimized score:", optimized_score)

Here, we improved the original qa_cot question-answering module with DSPy’s BootstrapFewShot optimizer. We use the small trainset as learning examples for better few-shot demonstrations. Then we compiled an optimized version of the model using up to 2 bootstrapped demos + 2 labeled demos. Finally, we run an evaluation on the new model with the same exact-match metric and print out the optimized score to show whether the performance improved over the baseline.

Use Case 2: Retrieval Augmented Generation RAG Pipeline

Retrieval-Augmented Generation solves a major pain point. Without RAG, LLMs can’t access your private or continuously changing knowledge unless you directly supply it at inference time. A typical end-to-end RAG pipeline consists of ingestion/chunking, embeddings, storage + retrieval, and final generation grounded on retrieved documents.

Step-by-step RAG with typed outputs and structured JSON

In the following program, we define a typed signature (lists and ints), use JSONAdapter, and return citations as indices into retrieved passages.

import dspy

# Configure LM with JSONAdapter so lists (like citations)
# are parsed reliably from model output.
lm = dspy.LM("openai/gpt-4o-mini")  # reads OPENAI_API_KEY from env
dspy.configure(lm=lm, adapter=dspy.JSONAdapter())

# Minimal local corpus for demo; replace with your documents or a vector DB.
corpus = [
    "Linux divides memory into regions; on 32-bit systems highmem is not permanently mapped.",
    "Low memory is directly addressable by the kernel; high memory is mapped on demand.",
    "Unrelated passage about iPhone apps.",
]

# Embedder for dense retrieval.
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=2)


class RagAnswer(dspy.Signature):
    """Answer using only the provided context passages."""
    context: list[str] = dspy.InputField(desc="retrieved passages")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="final answer grounded in context")
    citations: list[int] = dspy.OutputField(desc="indices of context passages used")


class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        self.respond = dspy.ChainOfThought(RagAnswer)

    def forward(self, question: str):
        # Retrieve top‑k passages.
        retrieved = search(question)
        ctx = retrieved.passages

        # Generate answer and citations.
        pred = self.respond(context=ctx, question=question)

        # Lightweight validation of citations indices.
        citations = pred.citations or []
        pred.citations = [i for i in citations if 0 <= i < len(ctx)]

        # Return a structured prediction.
        return dspy.Prediction(
            context=ctx,
            answer=pred.answer,
            citations=pred.citations,
            reasoning=pred.reasoning,
        )

# Instantiate the RAG module.
rag = RAG()

# Run a demo question.
out = rag(question="What are high memory and low memory in Linux?")

print("Answer:")
print(out.answer)
print("\nCitations (indices into context):")
print(out.citations)

Here we retrieve information from a small knowledge base in order to answer a question. The language model is configured with JSONAdapter to properly parse structured output (citation lists). An embedding-based retriever is created to find the most relevant passages from the corpus. Typed Signature defines a structured RAG task with fields for context, question, answer, and citations. The RAG module follows ChainOfThought to produce a grounded answer from the retrieved passages. Lastly, the citation indices are checked for validity before returning structured prediction, and a demo query is run about Linux memory.

Add a RAG metric that checks both correctness and grounding

Here’s a small example of a composite metric. It checks if the label matches and whether the predicted answer was found in the retrieved context. It returns a float for evaluation and a boolean for bootstrapping.

from dspy.evaluate import Evaluate
def grounded_answer_metric(example, pred, trace=None):
    # Case‑insensitive exact or near‑exact match on answer.
    answer_match = example.answer.lower() in pred.answer.lower()
    # Answer should appear in at least one retrieved passage.
    context_match = any(pred.answer.lower() in c.lower() for c in pred.context)
    if trace is None:
        # For evaluation: soft score between 0 and 1.
        return (answer_match + context_match) / 2.0
    # For bootstrapping / optimization: require both.
    return answer_match and context_match

devset = [
    dspy.Example(
        question="What is low memory in Linux?",
        answer="directly addressable by the kernel",
    ).with_inputs("question")
]

evaluator = Evaluate(devset=devset, num_threads=2, display_progress=True)
print(evaluator(rag, metric=grounded_answer_metric))

This code computes a custom metric to score how well a DSPy RAG pipeline is answering a question with grounded answers. grounded_answer_metric checks two things: 1) whether the predicted matches the expected answer, and 2) whether that answer can be grounded in the retrieved context passages. Then, Evaluate runs that metric on a small development set to validate whether your RAG pipeline returns grounded, correct answers before using it for optimization or production.

Optimize the RAG program with MIPROv2

Here we use DSPy’s MIPROv2 optimizer to improve the original RAG program against your custom grounding metric, then recompile the module with a small demo set and evaluate whether the optimized version performs better.

from dspy.teleprompt import MIPROv2
# Set up MIPROv2 optimizer with your custom metric.
tp = MIPROv2(
    metric=grounded_answer_metric,
    auto="light",          # or "medium" / "heavy"
    num_threads=4,
)
# Compile the original RAG module using the dev/train set.
rag_optimized = tp.compile(
    rag,
    trainset=devset,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
)
# Re‑evaluate the optimized RAG module.
print("Evaluation after MIPROv2 optimization:")
print(evaluator(rag_optimized, metric=grounded_answer_metric))

Use Case 3: Multi-Step Reasoning Agent with ReAct

When you have tasks that require tool use (whether that’s doing calculations, calling internal APIs, fetching knowledge, or taking actions), DSPy provides dspy.ReAct, which implements the ReAct (“Reasoning and Acting”) paradigm: the model reasons, chooses which tool to call, observes the results, and repeats until it can output final answers. ReAct can be generalized to function over any signature. It can accept either functions or dspy.Tool objects as tools.

A minimal ReAct agent with typed tools

The script below implements a small DSPy ReAct agent that answers questions by utilizing tools as needed. It sets up an LLM, defines two tools - one that returns the current UTC time and another that multiplies numbers - and passes those tools to dspy.ReAct. The agent will reason if it should use a tool, call it if needed, and then return the final answer.

import dspy
from datetime import datetime, timezone
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())

def utc_now() -> str:
    return datetime.now(timezone.utc).isoformat()

def multiply(a: float, b: float) -> float:
    return a * b

# Create a ReAct agent that can use utc_now and multiply.
agent = dspy.ReAct(
    signature="question -> answer",
    tools=[utc_now, multiply],
    max_iters=6,
)
# Example queries.
print(agent(question="What time is it in UTC right now?"))
print(agent(question="What is 19.5 * 4.2?"))

Production concern: agent reliability, costs, and guardrails

Agent loops can silently accumulate high costs (repeated LLM calls, repeated tool calls) or hallucinate invalid actions without guardrails and observability. A reasonable set of guardrails includes cap iterations (max_iters), tightening tool schemas and permissions, and validating on real traffic-like prompts before rollout.

Optimize a ReAct agent with DSPy optimizers

DSPy optimizers can optimize entire programs, including end-to-end complex multi-module systems (such as agents, retrieval, and extraction), as long as you specify a metric to improve. For many teams, a pattern that works well is:

Bootstrap a few demos with BootstrapFewShot(cheap);
Then, run MIPROv2 in auto=“light” or auto=“medium” depending on budget.

Use Case 4: Text Classification with LLM Metric Evaluation

Classification is an ideal DSPy use case because while success metrics (accuracy, F1) are straightforward, you can still take advantage of DSPy’s programmatic structure, typed outputs, and optimizers.

Build a typed classifier with Predict

Here’s code that builds a simple DSPy text classifier for support tickets. It sets up the model, declares a signature with one input (ticket) and one constrained output (label), then calls dspy.Predict to classify the ticket as one of four types: billing, bug, feature, or security. In this example, the “I was charged twice” complaint is correctly classified as billing.

import dspy
from typing import Literal
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"), adapter=dspy.JSONAdapter())
class TicketLabel(dspy.Signature):
    """Classify a support ticket into a fixed taxonomy."""
    ticket: str = dspy.InputField()
    label: Literal["billing", "bug", "feature", "security"] = dspy.OutputField()
clf = dspy.Predict(TicketLabel)
example = clf(ticket="I was charged twice for my subscription this month.")
print(example.label)

Evaluate with a metric (and optionally build an LLM-judge metric)

Metrics are ordinary Python functions. They should follow the signature(example, pred, trace=None); for complex outputs, metrics can use AI feedback via additional predictor calls.

The code below uses DSPy’s Evaluate utility to test a classifier, clf, on a small labeled dataset of support tickets. The trainset has three examples. For each example, the text of a ticket is labeled with the correct category (billing, bug, or feature). Passing .with_inputs(“ticket”) specifies to DSPy that the model should only receive the ticket text as input. The accuracy_metric function checks if the classifier’s predicted label matches the true label. It returns 1.0 if the prediction is correct and 0.0 otherwise. Evaluate runs clf on the dataset with 2 threads, displays progress while running, and print(evaluator(clf, metric=accuracy_metric)) prints the final result, which is usually the accuracy of the model on those examples.

from dspy.evaluate import Evaluate
trainset = [
    dspy.Example(ticket="I was charged twice.", label="billing").with_inputs("ticket"),
    dspy.Example(ticket="The app crashes on launch.", label="bug").with_inputs("ticket"),
    dspy.Example(ticket="Please add export to CSV.", label="feature").with_inputs("ticket"),
]
def accuracy_metric(example, pred, trace=None):
    return float(example.label == pred.label)
evaluator = Evaluate(devset=trainset, num_threads=2, display_progress=True)
print(evaluator(clf, metric=accuracy_metric))

Assertion testing and constraint enforcement in modern DSPy

In production, people often ask for “verification” operations: (“assertion testing”; the label must be one of X; JSON must parse; citations must be in range).

dspy.Refine was purpose-built to be a best-of-N refinement loop with reward_fn and threshold. It repeatedly calls the module N times and returns the best prediction, generating feedback between attempts if necessary. Here’s a real-world “constraint enforcement” wrapper: retry until output taxonomy is respected. Let’s consider the following code:

import dspy
from typing import Set
allowed: Set[str] = {"billing", "bug", "feature", "security"}
def label_is_valid(args, pred):
    return 1.0 if pred.label in allowed else 0.0
robust_clf = dspy.Refine(module=clf, N=3, reward_fn=label_is_valid, threshold=1.0)
print(robust_clf(ticket="Please add SSO support.").label)

This code wraps the original classifier with dspy.Refine, which allows DSPy to retry up to 3 times and retain only outputs that passed reward_fn. The reward function ensures the predicted label is one of our allowed categories, and the threshold=1.0 means only a fully valid label will be accepted before returning the result.

Choosing the Right DSPy Optimizer

DSPy now refers to these algorithms as optimizers (previously teleprompters). According to the optimizer documentation, an optimizer is an algorithm that tunes a DSPy program’s parameters (prompts and/or LM weights) to maximize your metrics using your program, metric, and training inputs. The training inputs are often a small set of examples.

Practical decision criteria

This table lists the 3 optimizers your brief prioritizes—BootstrapFewShot, MIPROv2, and COPRO—as well as BootstrapFewShotWithRandomSearch, which DSPy recommends after you have more data.

Optimizer	What it does and when to use it	Data guidance and key config knobs
BootstrapFewShot	Tunes few-shot demos assembled from labeled and bootstrapped examples validated by the metric. It works well for fast wins on small datasets and is a strong first compile option.	Start here when you have around 10 examples. Knobs: `max_labeled_demos`, `max_bootstrapped_demos`, `teacher_settings`
BootstrapFewShotWithRandomSearch	Tunes few-shot demos like BootstrapFewShot, but tests multiple candidate demo sets and keeps the best one. It is better for a more robust few-shot selection while staying relatively simple.	Best when you have around 50 or more examples. Knobs: `num_candidate_programs`, plus the BootstrapFewShot knobs
COPRO	Tunes prompt instructions through iterative search, documented as coordinate ascent in the optimizer guide. It is useful when you want instruction tuning without focusing heavily on demos.	Usually needs a train set and a metric. Knobs: `breadth`, `depth`, `init_temperature`
MIPROv2	Jointly tunes instructions and few-shot examples using Bayesian optimization. It is the strongest choice when you want higher-quality prompt optimization and have enough budget and data.	Best for longer runs, such as 40 or more trials, with around 200 or more examples to reduce overfitting risk. Knobs: `auto` (“light/medium”), `num_threads`, plus demo knobs in `compile()`

Running DSPy on DigitalOcean

Deployment should provide you with two things: (1) infrastructure to run your DSPy program (stable runtime) and (2) access to LLMs you can reliably call to run retrieval and add guardrails.

Deployment patterns that map well to DSPy pipelines

Deploy your DSPy service to a Virtual Machine (VM) or GPU instance if you want full control of everything in your stack (vector DB, embeddings, model runtime). Building a RAG application on GPU Droplets is covered in step-by-step detail with DigitalOcean’s RAG tutorials.

Use a fully managed model access for simpler operations. The DigitalOcean Gradient platform describes serverless inference (no infrastructure management) and API access to models hosted by major vendors (OpenAI, Anthropic, etc) as well as managed scalability and security features for open-source models hosted directly in-platform.

Build agentic apps with managed agent features. DigitalOcean’s Gradient AI Platform quickstart describes fully managed agents with knowledge bases for retrieval-augmented generation, multi-agent routing, and guardrails.

FAQ SECTION

Q1: What is a DSPy use case best suited for?

A DSPy use case is best suited for when: (a) you will be making repeated LM calls as part of a pipeline, (b) you can define some automatic metric for whether a call was successful, and © you’d like a systematic approach to iteratively improve prompts/demos/weights over time as models, data, or requirements change.

Q2: How does DSPy ChainOfThought differ from the Predict module?

dspy.Predict is the basic module that maps inputs to outputs using an LM and a signature. dspy.ChainOfThought is a specialized module that explicitly “reasons step by step” by prepending a reasoning field to the signature, then predicting outputs—useful when intermediate reasoning improves correctness.

Q3: What is a DSPy teleprompter or optimizer?

With terminology recently updated, what were previously called “teleprompters” are now optimizers. An optimizer is an algorithm that takes your DSPy program (which includes its prompts + underlying LM weights) + your metric + training inputs, and tunes the program’s parameters to maximize that metric.

Q4: Can DSPy be used for RAG pipelines?

Yes! DSPy’s RAG tutorial walks through building a retriever module (e.g., embeddings-based top-K search) and then composes it with a generation module within DSPy*.Module* so the whole RAG program itself can be evaluated and optimized.

Q5: What is the difference between BootstrapFewShot and MIPROv2?

BootstrapFewShot is about assembling few-shot demonstrations (labeled and bootstrapped) and evaluating them against the metric. MIPROv2 jointly optimizes both instructions + demonstrations using bootstrapped traces along with a Bayesian optimization search over the space of candidate instructions/demos. DSPy recommends starting with BootstrapFewShot when examples are scarce and graduating to MIPROv2 when you have enough data and budget to mitigate overfitting concerns.

Q6: How do DSPy assertions and metric functions work?

DSPy metrics are implemented as Python functions that accept (example, pred, trace=None) and return a score. During optimization, the metric can optionally use the trace to validate intermediate steps or enforce stricter rules. Legacy dspy.Assert / dspy.Suggest constructs are deprecated/not supported in DSPy, but you can use dspy.Refine (or dspy.Suggest) today for guided self-correction and constraint-based refinements.

Q7: Does DSPy support structured output?

Yes! When defining your DSPy signatures, you can declare typed fields (including nested, non-primitive types). Adapters are responsible for formatting those prompts appropriately and parsing outputs. JSONAdapter will prompt models to “respond with JSON” when possible for more robust structured parsing. ChatAdapter prompts include field markers to delimitate typed fields, and also support adding a JSON schema if using custom non-primitive field types.

Q8: What LLM providers does DSPy support?

DSPy supports dozens of LLM providers through its dspy.LM wrapper around LiteLLM, using simple {provider}/{model} syntax like openai/gpt-4o-mini or anthropic/claude-3-5-sonnet-20240620.

Conclusion

SPy represents a meaningful shift in how modern LLM systems are built. Instead of viewing prompts as static strings, DSPy treats them as components of a larger program composed of signatures, modules, metrics, and control flow. This approach really shines when you graduate from simple completions to authoring tangible application patterns such as ChainOfThought QA, RAG with structured outputs, ReAct-based tool use, and classification pipelines with integrated quality checks.

The larger point here is that DSPy isn’t simply a playground for prompt engineering. DSPy is a practical foundation for building, validating, iterating, and scaling your LLM systems with more rigor. As engineering teams require better guarantees around reliability, observability, and control over agentic behavior, DSPy will be ready to take on a larger role in production AI stacks. The future will belong to those engineers who build LLM workflows that are modular, testable, and optimization-driven from the start.

References

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Adrien Payong

Author

AI consultant and technical writer

See author profile

I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.

See author profile

Shaoni Mukherjee

Editor

AI Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags:

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this

DSPy Use Cases: Build Optimized LLM Pipelines

Key Takeaways

What Is DSPy and Why Use It for LLM Pipelines

Where DSPy fits versus LangChain and LlamaIndex

DSPy core building blocks you will use in this tutorial

Unified end-to-end pipeline example

Use Case 1: Question Answering with ChainOfThought

Improving QA with BootstrapFewShot

Use Case 2: Retrieval Augmented Generation RAG Pipeline

Step-by-step RAG with typed outputs and structured JSON

Add a RAG metric that checks both correctness and grounding

Optimize the RAG program with MIPROv2

Use Case 3: Multi-Step Reasoning Agent with ReAct

A minimal ReAct agent with typed tools

Production concern: agent reliability, costs, and guardrails

Optimize a ReAct agent with DSPy optimizers

Use Case 4: Text Classification with LLM Metric Evaluation

Build a typed classifier with Predict

Evaluate with a metric (and optionally build an LLM-judge metric)

Assertion testing and constraint enforcement in modern DSPy

Choosing the Right DSPy Optimizer

Practical decision criteria

Running DSPy on DigitalOcean

Deployment patterns that map well to DSPy pipelines

FAQ SECTION

Conclusion

References

About the author(s)

Still looking for an answer?

Join the Tech Talk

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

The developer cloud

Start building today