What is DSPy?

Saturday, June 27, 2026

If you've looked at DSPy and come away unsure what it's actually for, you're not alone — most explanations jump straight to optimizers and signatures before saying why you'd want any of it. The point is simpler than the jargon: DSPy lets you stop hand-writing prompts and start writing LLM code you can test and improve like real software.

That's the whole pitch. Everything else is mechanism.

The problem it removes

The default way to build with an LLM is to write a prompt — a paragraph of instructions with the inputs slotted in — send it, and parse whatever comes back. It works in a demo. Then reality shows up: the model returns malformed output on some input you didn't imagine, you reword the prompt, that fixes one case and quietly breaks another, and you're back to tweaking a wall of text by feel. Worse, the prompt is tuned to one model's quirks, so the day that model changes, you start over.

That loop — reword, eyeball, hope — is what DSPy is built to end. The insight is that prompt strings are the wrong unit of work. They're not testable, not composable, and not portable. So DSPy replaces them with something that is.

The two ideas

DSPy stands on two moves, and once you have them the rest follows.

Typed signatures. Instead of writing the prompt, you declare the contract: this step takes a question and returns an answer and a confidence score, with types. That declaration is the interface. The actual instruction text underneath is generated to satisfy it, and the output comes back already shaped. Your LLM calls start to look like ordinary functions with a signature you can reason about, instead of artisanal paragraphs you maintain by hand.

A compiler. This is the part that makes people sit up. You give DSPy a metric and a handful of examples, and it optimizes your program — it searches for the instructions and examples that make the metric go up, and bakes the winners in. The prompt becomes a build artifact, not something you hand-craft. You define what "good" means and let the optimizer do the wording.

Put together: you describe the behavior in typed code, and the framework writes and tunes the prompt against a measure you chose. That's the point of DSPy.

What it looks like in practice

The shift is from authoring prompts to defining contracts and metrics. You write the signatures, you write an eval — a set of example inputs and a way to score the output — and the compiler closes the gap. If you've read why evals matter, this is where they pay off: the optimizer needs a metric to optimize against, and the eval is that metric. No eval, nothing to compile toward.

It also changes what happens when the model changes. A program built on typed signatures survives a model swap — you re-compile against the new model and recover quality. A pile of hand-tuned prompts does not; a model deprecation means re-tuning every string. When a system needs to outlive any single model, that portability is the difference between a tweak and a rewrite.

Two examples from my own products

This isn't theory for me — two live products run on it:

builddaily.io — the chat agent. The assistant that answers visitors as "me" runs on a DSPy signature, not a hand-written prompt. It takes the question plus my background and returns typed fields — a reply, one concrete number, and an artifact to point at — so the answer always comes back shaped, instead of rambling or breaking format on an odd question.
paiddaily.io — market research. A DSPy module reads raw market data and returns a structured research readout that feeds the opportunity surface. Same move: declare what a "research result" looks like, let the model fill it, and the output stays consistent no matter which market goes in.

In both, the win is the same — the contract is in code, so the behavior is testable and the format can't drift.

When to use it

Reach for DSPy when the LLM feature is going to live:

The behavior has to be reliable and measurable — you can write down what "good output" means and score it. That's the precondition for the whole approach.
It's a real pipeline, not a one-off — multiple steps, structured outputs, something you'll maintain and want to keep improving on your own data.
Quality has to be defensible — you'd sooner point at a metric on a held-out set than at a prompt you got a good feeling about.
You expect to change models — a cheaper one, a local one, next quarter's release — and you don't want a rewrite each time.

That covers most of what I build, which is why it's my default. (More on that bet in when to fine-tune an LLM and the DSPy resource notes.)

When to skip it

Being honest about the wrong cases matters as much as the right ones:

A throwaway script. For a chain you'll run twice and delete, the optimization loop is overhead you don't need — write the prompt and move on.
The wording is the product. If you need a hand on every token — a specific voice, an exact phrasing — DSPy's "let the compiler write it" model fights you.
You have no way to measure quality. Without a metric there's nothing to compile toward, and you're using the heavy machinery for its typing alone. Worth it sometimes, but know that's what you're doing.
You're still learning what the feature even is. Early exploration is faster with a raw prompt. Reach for DSPy once the shape of the thing is clear and you're ready to make it solid.

The one-line answer

The point of DSPy is to treat building with an LLM like engineering instead of wordsmithing — typed contracts, a metric, and a compiler that tunes the prompt for you. Use it when the thing has to work reliably, be measured, and outlive the model it runs on. Skip it when you're prototyping, when the exact wording is the point, or when you can't yet say what "good" means.

What is DSPy?

The problem it removes

The two ideas

What it looks like in practice

Two examples from my own products

When to use it

When to skip it

The one-line answer

Continue reading

Anatomy of an AI product

DSPy vs LangChain for typed LLM programming

Build an AI Product