DSPy Expert Guide
DSPy is a declarative framework for programming language models. Instead of hand-writing prompts, you define what your program should do (via Signatures and Modules), and DSPy figures out how to prompt the LM to do it — including automatic optimization.
Core Mental Model
Signature → defines I/O schema (what to compute)
Module → implements a reasoning strategy (how to compute)
Optimizer → tunes prompts/weights automatically (how to improve)
Evaluate → measures quality (how to measure)
1. Language Model Setup
DSPy uses LiteLLM under the hood, so any provider is supported.
import dspy
# OpenAI
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_KEY')
# Anthropic
lm = dspy.LM('anthropic/claude-sonnet-4-5-20250929', api_key='YOUR_KEY')
# Google Gemini
lm = dspy.LM('gemini/gemini-2.5-pro-preview-03-25', api_key='YOUR_KEY')
# Ollama (local)
lm = dspy.LM('ollama_chat/llama3', api_base='http://localhost:11434')
# Custom OpenAI-compatible endpoint
lm = dspy.LM('openai/your-model', api_key='KEY', api_base='YOUR_URL')
# Configure globally
dspy.configure(lm=lm)
# Multiple LMs — use context managers for per-call override
with dspy.context(lm=dspy.LM('openai/gpt-4o')):
result = my_module(question="...")
Key LM parameters:
temperature— sampling temperature (default: 0.0 for deterministic)max_tokens— max output tokenscache=True— enable response caching (default True)num_retries=3— retries on transient failuresmodel_type—"chat","text", or"responses"(use"responses"for OpenAI reasoning models like o3/o4)use_developer_role=True— for OpenAI reasoning models that require developer system promptsfinetuning_model— specify a separate model name for fine-tuning (vs inference)callbacks=[...]— per-LM callback hooks
Reasoning models (o3, o4-mini, etc.):
lm = dspy.LM('openai/o3-mini', model_type='responses', temperature=1.0, max_tokens=16000)
dspy.configure(lm=lm, adapter=dspy.TwoStepAdapter(dspy.LM('openai/gpt-4o-mini')))
dspy.configure full options:
dspy.configure(
lm=lm,
adapter=dspy.ChatAdapter(), # global output adapter
callbacks=[...], # global callbacks
track_usage=True, # enable token usage tracking per prediction
allow_tool_async_sync_conversion=True, # allow async tools in sync context
experimental=True, # enable experimental features (BetterTogether, BootstrapFinetune)
max_errors=10, # stop Evaluate after N errors
)
Token usage tracking (requires track_usage=True):
dspy.configure(lm=dspy.LM('openai/gpt-4o-mini', cache=False), track_usage=True)
result = my_program(question="What is DSPy?")
print(result.get_lm_usage()) # {'openai/gpt-4o-mini': {'prompt_tokens': 120, 'completion_tokens': 45}}
2. Signatures
Signatures define the input/output schema of an LM call.
Inline (string) signatures
# Simple string: "inputs -> outputs"
predict = dspy.Predict("question -> answer")
cot = dspy.ChainOfThought("context, question -> answer")
Class-based signatures (recommended for typed, documented fields)
class Classify(dspy.Signature):
"""Classify sentiment of a product review."""
review: str = dspy.InputField()
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
confidence: float = dspy.OutputField(desc="confidence score 0-1")
class RAGAnswer(dspy.Signature):
"""Answer a question given retrieved context."""
context: list[str] = dspy.InputField(desc="retrieved passages")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="concise factual answer")
citations: list[int] = dspy.OutputField(desc="indices of supporting passages")
Signature manipulation
# Extend with new fields
ExtendedSig = MySignature.append("new_output", dspy.OutputField(desc="..."), type_=str)
# Update field descriptions
UpdatedSig = MySignature.with_updated_fields("answer", desc="detailed explanation")
# Add instructions
InstructedSig = MySignature.with_instructions("Always respond in Spanish.")
Supported field types: str, int, float, bool, list[T], dict[K,V], Optional[T], Union[T,U], Literal[...], dspy.Image, dspy.Audio, dspy.History, dspy.Code, dspy.File, custom Pydantic models.
dspy.Code — typed code output:
class CodeSig(dspy.Signature):
task: str = dspy.InputField()
solution: dspy.Code["python"] = dspy.OutputField() # language-typed code
predictor = dspy.Predict(CodeSig)
result = predictor(task="Write a binary search function")
# result.solution is a Code object with the generated python code
dspy.File — file data in pipelines (3.1.0+):
class ProcessFile(dspy.Signature):
file: dspy.File = dspy.InputField()
summary: str = dspy.OutputField()
dspy.Reasoning — capture native reasoning from reasoning models:
# dspy.Reasoning is a type for capturing the native internal chain-of-thought
# from reasoning models (o3, o4-mini, DeepSeek-R1).
# Use it as an output field type to explicitly request reasoning capture.
class ReasonedAnswer(dspy.Signature):
question: str = dspy.InputField()
reasoning: dspy.Reasoning = dspy.OutputField() # captures native CoT
answer: str = dspy.OutputField()
# NOTE: In DSPy 3.1.3, the automatic injection of dspy.Reasoning into
# ChainOfThought for reasoning models was reverted. You must explicitly
# include dspy.Reasoning in your Signature if you want to capture it.
# For non-reasoning models, dspy.Reasoning behaves like a regular str field.
3. Built-in Modules
All modules inherit from dspy.Module. Use them directly or compose them.
| Module | Description | Typical Use |
|---|---|---|
dspy.Predict | Single LM call | Simple extraction, classification |
dspy.ChainOfThought | CoT reasoning | Multi-step reasoning, explanation |
dspy.ProgramOfThought | Code-based reasoning (needs Deno) | Math, symbolic computation |
dspy.ReAct | Tool-use agent loop | Search, APIs, multi-tool agents |
dspy.CodeAct | Python code execution (pure fns only) | Complex computations via code |
dspy.RLM | Recursive LM — explores large contexts via REPL | Long documents, complex analysis (3.1.1+) |
dspy.MultiChainComparison | Ensemble of CoT chains | High-accuracy QA |
dspy.Refine | Iterative refinement with feedback | Quality improvement loops |
dspy.BestOfN | Sample N independently, pick best | Reliability via sampling |
dspy.Parallel | Run modules in parallel | Batch processing |
dspy.BestOfN vs dspy.Refine:
BestOfN(module, N=5, reward_fn=fn, threshold=1.0)— N independent runs, picks the best. No feedback between attempts.Refine(module, N=3, reward_fn=fn, threshold=1.0)— N runs with automatic feedback. After each failed attempt, DSPy generates hints ("Past Output" + "Instruction" fields) for the next run. UseRefinewhen each attempt can learn from the previous one.
dspy.CodeAct constraint: only pure Python functions as tools — no lambdas, callable objects, or external libraries:
from dspy.predict import CodeAct # note: not dspy.CodeAct directly
act = CodeAct("n -> factorial_result", tools=[factorial_fn], max_iters=3)
dspy.Parallel full API (3.1.2: timeout and straggler_limit now exposed):
parallel = dspy.Parallel(num_threads=8, timeout=120, straggler_limit=0.9, return_failed_examples=False)
results = parallel([(module, example1), (module, example2)])
# Convenience: every dspy.Module has .batch()
results = my_module.batch(examples=[ex1, ex2, ex3], num_threads=4, return_failed_examples=True)
# If return_failed_examples=True: returns (results, failed_examples, exceptions)
dspy.RLM — Recursive Language Model (3.1.1+): Explores large contexts via sandboxed Python REPL. Requires Deno.
rlm = dspy.RLM(
signature="context, query -> answer",
max_iterations=20, # maximum REPL loops
max_llm_calls=50, # maximum sub-LM calls
sub_lm=None, # optional cheaper model for sub-queries
tools=None, # list of custom tool functions
)
result = rlm(context="...very large document...", query="What is the revenue?")
print(result.answer)
print(result.trajectory) # list of {code, output} steps
Built-in REPL tools: llm_query(prompt), llm_query_batched(prompts), SUBMIT(...). Also supports aforward() for async.
dspy.LocalSandbox for code execution:
sandbox = dspy.LocalSandbox()
result = sandbox.execute("value = 2*5 + 4\nvalue") # returns 14
Usage examples
# Predict — basic
predictor = dspy.Predict("question -> answer")
result = predictor(question="What is 2+2?")
print(result.answer)
# ChainOfThought — adds step-by-step reasoning
cot = dspy.ChainOfThought("question -> answer")
result = cot(question="If a train travels 120km in 2h, what is its speed?")
print(result.reasoning, result.answer)
# ReAct — tool-using agent
def search_web(query: str) -> str:
"""Search the web for information."""
... # your implementation
react = dspy.ReAct("question -> answer", tools=[search_web])
result = react(question="Who won the 2024 Olympics marathon?")
# ProgramOfThought — generates and executes Python code (requires Deno runtime)
pot = dspy.ProgramOfThought("question -> answer", max_iters=3)
result = pot(question="What is the sum of squares from 1 to 10?")
print(result.answer) # "385"
# ProgramOfThought writes Python code, executes it in a sandbox, and extracts the answer.
# Ideal for math, symbolic computation, and data manipulation tasks.
# BestOfN — pick best of multiple samples
bon = dspy.BestOfN(dspy.ChainOfThought("question -> answer"), N=5, reward_fn=my_metric)
# Refine — iterative improvement
refine = dspy.Refine(dspy.ChainOfThought("draft -> refined"), N=3, reward_fn=quality_check)
4. Custom Modules
Build complex programs by composing modules:
class RAG(dspy.Module):
def __init__(self, num_docs=5):
self.num_docs = num_docs
self.retrieve = dspy.Retrieve(k=num_docs) # if using a retriever
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
class MultiHopRAG(dspy.Module):
def __init__(self, hops=2):
self.generate_query = [dspy.ChainOfThought("context, question -> query") for _ in range(hops)]
self.generate_answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = []
for i, gen_q in enumerate(self.generate_query):
query = gen_q(context=context, question=question).query
context += search(query) # your search function
return self.generate_answer(context=context, question=question)
Module API:
module.forward(**kwargs)— main logicmodule(...)— calls forwardmodule.named_predictors()— iterate over all sub-predictorsmodule.set_lm(lm)— set LM for all predictors in this modulemodule.get_lm()— get the LM currently used by the module's predictorsmodule.batch(examples, num_threads=2, return_failed_examples=False)— run module on a list of examples in parallel (returns list of results; ifreturn_failed_examples=True, returns(results, failed_examples, exceptions))module.deepcopy()— deep copy the modulemodule.reset_copy()— copy with reset statemodule.save(path)/module.load(path)— persistence
5. Data & Examples
# dspy.Example — structured training/eval data
example = dspy.Example(question="What is DSPy?", answer="A framework for LM programming")
# Specify which fields are inputs
example = example.with_inputs("question")
print(example.inputs()) # {'question': ...}
print(example.labels()) # {'answer': ...}
# dspy.Prediction — module output container
pred = dspy.Prediction(answer="42", reasoning="step by step...")
print(pred.answer)
# Load built-in datasets
from dspy.datasets import HotPotQA, GSM8K, MATH
hotpotqa = HotPotQA(train_seed=2024, train_size=500)
trainset = hotpotqa.train
devset = hotpotqa.dev
gsm8k = GSM8K() # grade-school math word problems
trainset = gsm8k.train # fields: question, answer (numeric)
math_ds = MATH() # competition-level math problems
trainset = math_ds.train # fields: question, answer
# DataLoader — load custom datasets from various sources
from dspy.datasets import DataLoader
dl = DataLoader()
# From HuggingFace
dataset = dl.from_huggingface(
"dataset_name",
fields=["question", "answer"],
input_keys=("question",),
split="train"
)
# Also supports:
# dl.from_csv("data.csv", fields=["q", "a"], input_keys=("q",))
# dl.from_json("data.json", fields=["q", "a"], input_keys=("q",))
# dl.from_pandas(df, fields=["q", "a"], input_keys=("q",))
# dl.from_parquet("data.parquet", fields=["q", "a"], input_keys=("q",))
6. Optimizers (Teleprompters)
See references/optimizers.md for full details. Quick reference:
| Optimizer | Best For | Data Needed | Notes |
|---|---|---|---|
BootstrapFewShot | Few-shot demos, fast | 5-50 examples | |
BootstrapFewShotWithRandomSearch | Better few-shot selection | ~50-200 examples | |
MIPROv2 | Full prompt + demo optimization | 50-300 examples | Default recommendation |
COPRO | Instruction-only optimization | 20-100 examples | |
SIMBA | Mini-batch stochastic optimization | 20-200 examples | Faster for large programs |
GEPA | Evolutionary prompt optimization | 50+ examples | 5-arg metric required |
BetterTogether | Prompt + weight joint optimization | 100+ examples | Requires experimental=True |
KNNFewShot | Dynamic example retrieval | Training set | |
Ensemble | Combine multiple programs | Multiple programs | |
BootstrapFinetune | Fine-tuning LM weights | 100+ examples | Requires experimental=True |
ArborGRPO | Reinforcement learning / GRPO | 100+ examples | pip install arbor-ai; multi-module RL |
GEPA critical note — its metric must accept 5 arguments:
def gepa_metric(gold, pred, trace, pred_name, pred_trace):
return gold.answer.lower() == pred.answer.lower()
optimizer = dspy.GEPA(metric=gepa_metric, auto="medium", reflection_lm=dspy.LM('openai/gpt-4o'))
BetterTogether (experimental) — combines prompt + weight optimization:
dspy.settings.experimental = True
from dspy.teleprompt import BetterTogether
optimizer = BetterTogether(metric=my_metric)
optimized = optimizer.compile(program, trainset=trainset, strategy="p -> w -> p")
# Standard optimization pattern
optimizer = dspy.MIPROv2(metric=my_metric, auto="medium", num_threads=8)
optimized = optimizer.compile(my_program, trainset=trainset)
optimized.save("optimized.json")
7. Evaluation
See references/evaluation.md for full details.
# Define a metric
def my_metric(example, pred, trace=None):
return example.answer.lower() == pred.answer.lower()
# Run evaluation
evaluator = dspy.Evaluate(
devset=devset,
metric=my_metric,
num_threads=8,
display_progress=True,
display_table=5,
)
score = evaluator(my_program) # returns EvaluationResult
# Built-in metrics
dspy.evaluate.answer_exact_match # exact string match
dspy.evaluate.answer_passage_match # answer appears in passage
dspy.SemanticF1() # LM-based semantic F1 score
dspy.CompleteAndGrounded() # checks if answer is complete and grounded in context
8. Output Quality Enforcement (Refine / BestOfN)
dspy.Assert and dspy.Suggest were deprecated and removed in DSPy 3.1.x. Use dspy.Refine or dspy.BestOfN instead.
dspy.Refine — iterative refinement with automatic feedback
After each failed attempt, DSPy automatically generates feedback ("Past Output" + "Instruction" fields) and feeds it to the next attempt. Use when each retry can learn from the previous one.
def quality_check(example, pred, trace=None):
"""Return a float 0-1 or bool. Refine retries until threshold is met."""
return len(pred.answer) > 10 and pred.answer[0].isupper()
class QAModule(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought("question -> answer")
def forward(self, question):
return self.generate(question=question)
# Wrap any module with Refine
refined = dspy.Refine(
module=QAModule(),
N=3, # max attempts (default 3)
reward_fn=quality_check,
threshold=1.0, # reward must reach this to stop early
)
result = refined(question="What is DSPy?")
dspy.BestOfN — N independent samples, pick best
Runs N independent calls (no feedback between them) and returns the one with the highest reward. Simpler but uses more tokens.
def my_metric(example, pred, trace=None):
return float(pred.answer.lower().count("dspy") > 0)
best = dspy.BestOfN(
module=dspy.ChainOfThought("question -> answer"),
N=5, # number of independent samples
reward_fn=my_metric,
threshold=1.0, # stop early if reward reaches this
)
result = best(question="Explain DSPy in one sentence.")
When to use which
- Refine: When feedback helps — e.g., "your answer was too short, try again with more detail"
- BestOfN: When outputs are independent — e.g., creative tasks where variety helps
- Typed
Literalfields: When you just need to constrain outputs to a set of values — useLiteral["a", "b", "c"]in the Signature (no retry needed, the adapter enforces it)
9. Tools & MCP
See references/advanced.md for full details.
# Define tools as Python functions (used in ReAct/CodeAct)
def calculator(expression: str) -> float:
"""Evaluate a mathematical expression."""
return eval(expression)
def search(query: str, k: int = 3) -> list[str]:
"""Search Wikipedia for passages."""
results = dspy.ColBERTv2(url='http://...')(query, k=k)
return [r['text'] for r in results]
agent = dspy.ReAct("question -> answer", tools=[calculator, search])
# MCP tool integration (async)
from mcp import ClientSession
dspy_tool = dspy.Tool.from_mcp_tool(session, mcp_tool_object)
agent = dspy.ReAct("question -> answer", tools=[dspy_tool])
10. Special Data Types
# Images (multimodal)
class DescribeImage(dspy.Signature):
image: dspy.Image = dspy.InputField()
description: str = dspy.OutputField()
img_from_url = dspy.Image.from_url("https://example.com/image.jpg")
img_from_file = dspy.Image.from_file("local.png")
img_from_b64 = dspy.Image(url="data:image/png;base64,...")
# Audio
class TranscribeAudio(dspy.Signature):
audio: dspy.Audio = dspy.InputField()
transcript: str = dspy.OutputField()
audio = dspy.Audio.from_file("speech.mp3")
# Conversation History
class Chat(dspy.Signature):
history: dspy.History = dspy.InputField()
message: str = dspy.InputField()
response: str = dspy.OutputField()
history = dspy.History(messages=[
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"},
])
11. Save & Load
Two modes — state-only (recommended) vs full program:
# State-only (JSON) — saves signatures, demos, LM per predictor
optimized_program.save("my_program.json")
# State-only (pickle) — needed when state contains non-JSON-serializable objects (e.g., dspy.Image)
optimized_program.save("my_program.pkl", save_program=False)
# Full program (architecture + state) — saves to directory via cloudpickle
optimized_program.save("./my_program_dir/", save_program=True)
# With custom modules that need to be serialized by value:
optimized_program.save("./my_program_dir/", save_program=True, modules_to_serialize=[my_module])
# Load state-only (requires re-instantiating the class)
loaded = MyProgramClass()
loaded.load("my_program.json")
# Load full program (architecture + state) — no need to re-define the class
loaded = dspy.load("./my_program_dir/")
# Returns a fully functional module with all state, signatures, demos, and LM config restored.
Security: .pkl files and dspy.load() use cloudpickle/pickle and can execute arbitrary code on load — only load from trusted sources.
Quick Patterns
RAG pipeline
class RAG(dspy.Module):
def __init__(self):
self.respond = dspy.ChainOfThought('context, question -> answer')
def forward(self, question):
context = search(question) # your retrieval function
return self.respond(context=context, question=question)
rag = RAG()
tp = dspy.MIPROv2(metric=dspy.SemanticF1(), auto="medium")
optimized_rag = tp.compile(rag, trainset=trainset)
Classification with typed constraints
class Classify(dspy.Module):
def __init__(self, classes):
self.classes = classes
# Literal type constrains the output — no runtime assertions needed
self.predict = dspy.Predict(
dspy.Signature("text -> label").with_updated_fields(
"label", type_=Literal[tuple(classes)]
)
)
def forward(self, text):
return self.predict(text=text)
# For extra reliability, wrap with Refine
classifier = Classify(classes=["positive", "negative", "neutral"])
def valid_label(example, pred, trace=None):
return pred.label in ["positive", "negative", "neutral"]
safe_classifier = dspy.Refine(classifier, N=3, reward_fn=valid_label)
Inspect LM calls
dspy.inspect_history(n=5) # show last 5 LM interactions
Reference Files
references/optimizers.md— Deep dive: all optimizers with parameters and strategiesreferences/evaluation.md— Evaluation, metrics, built-in datasetsreferences/advanced.md— Adapters, callbacks, streaming, async, MCP, fine-tuning