Building with AI is not the same as building an AI product
I confused two things for longer than I'd like to admit. They both get called "AI," and that single word hides a fork that changes how you build, how you test, and how you know when you're done.
Once I named the fork, a bunch of decisions that had felt like judgment calls turned into obvious ones. So before I write a single line of either, I want to put the distinction down plainly.
Two kinds of work
The first is AI-assisted development. Here, AI is a tool inside my workflow. It helps me design a system, write the code, review the code, catch the bug. But the thing that ships is still code — deterministic, mine, judged by whether it's well-built. The model never shows up in the product. It helped me make the product. My coding agent lives entirely in this world.
The second is the AI product. Here, the model runs in production, in front of a user, and its output is the thing they're paying for. A tarot-and-astrology app that writes you a reading. A tool that reads the market and surfaces opportunities worth a look. The model isn't helping me build the feature — the model is the feature.
Same two letters. Completely different jobs.
The question that tells them apart
If you want a clean test for which one you're looking at, ask: how do you know it's good?
For code, the answer is old and well-understood. The tests pass. The review is clean. It matches the architecture I intended. I can open the file and read every line. The quality bar is my own engineering taste, and I can inspect the work directly.
For an AI product, that falls apart immediately. You can't read every output a model will ever produce. A reading, a recommendation, a summary — there's no line of code to inspect, because the interesting behavior emerges at runtime, on inputs you've never seen. So you need a different instrument: an eval. At its core an eval is just two things — a set of example inputs, and a metric that scores how good an output is. You stop eyeballing and start measuring.
That single question reorganizes everything. Test-driven development is the quality mechanism for building-with-AI. Evals are the quality mechanism for AI products. It's the same instinct — measure it before you ship it — pointed at two completely different artifacts. Notice them as the same discipline and a lot of the fog clears.
Why this isn't academic
Here's the moment it stopped being theory for me.
I keep a strict standard for the AI inside my products: every model call goes through a typed, optimizable program rather than a hand-tuned prompt string. That's the idea behind DSPy — you declare the intent of a step as a typed signature, and let an optimizer tune the actual prompt against a metric. Programming, not prompting. For my products, it's exactly right, because quality there is a number I'm trying to move.
For a while I forced my coding agent to live by that same rule. And it was wrong — the two-bucket model is what showed me why.
My coding agent isn't a product. I don't want to optimize it against a metric. I want it to follow my taste: my architecture, my tests-first discipline, my opinions about how an app should be structured. There's no eval dataset for "wrote the code the way I'd have written it" — there's just me, reviewing. So I pulled the optimization machinery off the agent and let it reason directly, and I kept that machinery exactly where it belongs: in the products, where good is something I can score.
The decision felt arbitrary right up until I named the two buckets. Then it was obvious. That's the whole value of the distinction — it turns taste calls into clear ones.
Where they touch
They're not sealed-off worlds, and pretending they are will trip you up too. Building-with-AI is how I build my AI products. One is the factory; the other is what comes off the line.
And inside a single product there are really two kinds of code living side by side. There's the plumbing — auth, the database, the API, the interface — which is ordinary software, held to my engineering standard like anything else. And there's the intelligence — the model calls, the eval harness, the retrieval — which lives by the product rules. A product like the one I'm building for full-time traders is built by my development process but contains AI-product machinery. Keep that seam in view and the model holds up everywhere.
So I'm writing about both, on purpose
This is why I'm splitting the build-in-public writing into two threads instead of one mush labeled "AI."
One thread is about how I build: the standard skeleton I want under every app I make, why I put the tests first, the opinions I hold about structure, and the coding agent I've taught to enforce them. Same skeleton every time, room to swap the stack inside it.
The other is about the product side: what an eval actually is and why it's the foundation everything else stands on, declarative LLM programming, the optimizers that tune these systems, and how an app decides — at runtime, for a stranger — what counts as a good answer.
If you take one thing from this post, take the question. Next time someone says "we're doing AI," ask which one they mean. Are they making the model help them build, or making the model the thing they sell? The answer changes the entire playbook — and most of the confusion in this space is two people using one word for two different jobs.
