sdkStanford NLPapplied

DSPy

Stanford's Python framework for programming — not prompting — language models. Replaces fragile prompt templates with typed Signatures, swappable Modules, and Optimizers that compile prompts against an eval set. Provider-agnostic (Claude, GPT, Ollama, Bedrock — one config swap). Runs every brain-layer module across the GL stack today.

Updated May 24, 2026

DSPy is the framework you reach for when the prompt has become the bug. Signatures name the input/output contract; the optimizer compiles the wording against examples. This page is the orient-and-pick-the-right-Module surface — official docs at dspy.ai own the API contracts.

What it is

A Python framework for building LM applications as composable, typed programs instead of strings. Released by Stanford NLP; Apache 2.0; pip install dspy. The unit of work is a Signature — a class declaring inputs, outputs, and the task in plain English — wrapped in a Module (e.g., Predict, ChainOfThought, ReAct). An Optimizer (e.g., BootstrapFewShot, MIPROv2) compiles the module against a small eval set, producing a prompt that ships.

The pitch: the same module compiles down to a different prompt per backing model (Claude 4.7, GPT-5, Llama 3.3, etc.) without rewriting the logic. The discipline the framework forces — typed I/O, explicit examples, compiled artifacts — is the production-readiness step that hand-rolled prompts keep trying to fake.

When to use it

Reach for it when:

The workload is multi-step or structured-output — classifiers, extractors, multi-field generators, agent loops. The Signature is doing real work.
The prompt has become the bug — you're patching a template every time the model bumps, and the patches are accumulating.
You need provider portability — same code against Claude in prod and Ollama locally / for fallback.
You have or can build a small eval set (10–30 labeled examples + a metric function). Without one, the optimizer can't earn its keep.
You're orchestrating multiple LM calls and want them composable — ChainOfThought, ReAct, custom Modules.

Skip it when:

It's a one-shot prompt with no measurable contract — a single completion where a string template is fine.
The actual workload is document-heavy RAG — that's the LlamaIndex layer. DSPy lives inside a query engine for answer generation, not as the whole stack.
You have no eval set and no path to one — typed Signatures still help, but the optimizer becomes dead weight.
The team has zero appetite for a compiled-artifact deploy step.

At a glance

Core primitives

Signature — typed I/O contract for one LM call. Defines InputFields, OutputFields, and a docstring describing the task. Voice and constraints live here, not in a separate system prompt.
Module — wraps a Signature with an execution strategy. Predict (direct), ChainOfThought (reasoning first), ReAct (tool use loop), ProgramOfThought (code-as-reasoning).
LM — the backend config. One config object selects the provider/model; the same Module runs unchanged against Claude, GPT, an Ollama-hosted local model, Bedrock, or vLLM.
Optimizer (a.k.a. teleprompter) — compiles a Module against an eval set. BootstrapFewShot (cheap, adds examples), MIPROv2 (joint instruction + demo search), BootstrapFinetune (produces an FT dataset).
Evaluate — runs a Module over a dataset with a metric function; the same metric drives the optimizer.

Provider matrix (where GL touches it)

Claude (Anthropic) — primary for Sage's high-quality reading flow.
Ollama (local) — used for cost-bounded operations and offline modes; same Module, one config flip.
OpenAI / Bedrock / Together — supported; not in GL prod today.

How to integrate

Default integration order for a new GL brain-layer module:

Write the Signature first. Inputs, outputs, docstring describing the contract — no implementation. This is the spec.
Pick a Module strategy. Default to Predict. Reach for ChainOfThought only when the output quality demonstrably improves on a held-out eval; the extra tokens aren't free.
Wire the LM config. One object — dspy.LM("anthropic/claude-opus-4-7") or dspy.LM("ollama_chat/qwen3:8b"). Set globally with dspy.configure(lm=...).
Build the eval set. Even 10–30 examples is enough to start optimizing meaningfully. The metric function is the contract — if you can't write the metric, the Signature isn't tight enough yet.
Compile. BootstrapFewShot first; graduate to MIPROv2 when the cheap pass plateaus. Save the compiled artifact (.save(...)); load it in prod.
Telemetry. Inputs, outputs, and metric scores go to the error-tracking log + Behavior nodes. Compilation is the eval surface; production calls are the regression set.

The optimizer matters less than the discipline. A non-compiled DSPy module with a typed Signature still beats a 200-line system prompt in maintainability terms. Compile when the eval shows headroom.

In the GL stack

Concrete places DSPy slots into the three active GL products today or next.

builddaily.io

Chat-bridge answer synthesizer. The chat at the top of the site currently dumps the full markdown corpus into the system prompt. Once the retrieval upgrade lands (LlamaIndex), the answer call becomes a DSPy Module compiled against ~30 known-good (question → answer + sources) pairs.
Build Daily voice writer. Module that takes raw bujo log fragments → polished post-style snippet in the publication voice. Removes a chunk of the post-draft editing loop.
Daily-log takeaway extractor. ChainOfThought over each day's entries → 1–2 sentence weekly-review takeaway. Feeds the week-in-review summary.

paiddaily.io

Pendle catalyst classifier. Signature: (raw_announcement_text) → (catalyst_type, urgency, market_impact, suggested_action). Compile against the 14-story Pendle epic data already on disk.
Aerodrome pool risk explainer. Natural-language summary of why a pool is or isn't deployable, given TVL / vote-weight / IL signals. Feeds the pool detail page and the morning brief.
Morning Brief writer. One Module synthesizes across Pendle / Aerodrome / Tickers in the Trading Almanac voice instead of three hand-written templates. Compiled against Neil-edited briefs.

sagedaily.io

Already running ~22 modules in production — readers, reflectors, theme extractors, recapers, deflectors, hashtag writers. The next-slice work is optimization, not greenfield.
Compile OracleReader against a graded set. Neil scores N readings as "great / fine / off-voice"; bootstrap-compile the Module against the great set; A/B vs the uncompiled version.
Split spread-chat-reflector into a multi-stage Module. Theme extract → reflective angle → write. Composes smaller compiles instead of one big one.

Gotchas

The Signature docstring is the system prompt. Vague docstrings produce vague prompts. Treat it like a one-paragraph spec, not a comment.
ChainOfThought is not always better. It adds a reasoning field that costs tokens and can leak voice. Default to Predict; promote only against a measured eval win.
Optimization needs an eval set, not vibes. A handful of labeled examples + a working metric function is the floor. Without them, the optimizer cannot improve anything — and you don't know if it did.
Compiled artifacts are model-tied. A prompt compiled against Claude 4.6 may not be optimal on Claude 4.7. Re-compile on model bumps; the cost is one optimizer run.
Don't reach for DSPy as a one-size-fits-all deflection. It's the brain layer. Document ingestion, hybrid retrieval, reranking — that's the floor below (LlamaIndex).

Risks

Single-maintainer concentration. DSPy is primarily a Stanford NLP research project. Move-fast era — API has churned across 2.x → 3.x. Production code should pin a version and budget a re-compile per upgrade.
Optimizer cost. MIPROv2 and friends call the LM many times during a compile. Plan for the compile-time bill, not just the inference bill.
Lock-in is moderate but real. Migrating a fleet of DSPy Modules back to template-string prompts is a real porting cost. Worth it if the discipline is paying off; worth knowing before you start.

Alternatives

Alternatives · 5 substitutesPick DSPy unless one of these wins on your specific brief.

01
Instructor / Outlines
Structured-output libraries — typed responses without the compilation surface.
Wins when ▸you want typed outputs and one-shot completions, not the optimizer + compilation overhead. Lighter touch; no prompt-compilation story but no need for one either.
02
LangChain · LCEL
Broader chain & agent framework with declarative pipelines.
Wins when ▸the team is already on LangChain end-to-end and rewriting to DSPy isn't worth it. LCEL is less typed but more ubiquitous.
03
Guidance · Microsoft
Constrained generation — token-level control of the output structure.
Wins when ▸you want token-level control over the output shape. Different paradigm — guides decoding rather than compiling prompts.
04
Raw prompts + Pydantic
A template plus a JSON-schema validator — the floor approach.
Wins when ▸one-shot completion, contract is "return this JSON shape," no eval surface needed. No compilation overhead, no framework footprint, no abstraction tax.
05
Hand-rolled
Template strings + your own eval loop — bring your own discipline.
Wins when ▸throwaway prototypes, demos, or workloads small enough that the framework is overhead. Graduates to DSPy when the prompt becomes the bug.

LlamaIndex — the document-layer counterpart. DSPy lives inside a LlamaIndex query engine when the workload is document-heavy RAG.