You can't optimize what you can't measure

Tuesday, June 16, 2026

When I write ordinary code, I know whether it works. The test passes or it doesn't. I can open the file and read every line, and the behavior is the same every time I run it.

When the product is a model's output, "works" gets slippery fast. A recommendation, a score, a summary — there's no single line to inspect, because the interesting behavior shows up at runtime, on inputs I've never seen. So how do I know it's any good? The answer has a boring name, and it's the most important idea on the product side of AI: the eval.

What an eval actually is

Strip away the buzz and an eval is two things:

A set of example inputs, and a metric that scores how good the output is.

That's the whole concept. It's the language-model equivalent of a test suite — except instead of asserting result === 42, you're asking "on these hundred realistic cases, how good were the answers, by a measure I've defined?"

Everything fancy in this space is built on that plain foundation. Miss it and the fancy stuff has nothing to stand on.

Why you can't just eyeball it

The tempting shortcut is to run a few examples by hand, read the outputs, decide they look fine, and ship. It feels like diligence. It isn't.

You can't read every output a model will ever produce, and the handful you do read will flatter you. The cases you think to try are the cases you already designed for. The failures live in the inputs you didn't imagine — the weird phrasing, the edge case, the user who asks something sideways. Eyeballing a few examples doesn't scale, and worse, it quietly lies to you about how good the system is.

A real eval forces honesty. You write down what good means, you collect cases that include the awkward ones, and you let the number tell you where you stand.

Scoring the unscoreable

The obvious objection: some outputs can't be checked with ===. Take a real one from paiddaily. The app scans for yield opportunities and tags each with an opportunity score — a fit rating with a line of reasoning ("strong monthly cashflow, but the liquidity is thin and the position unwinds in six weeks"). There's no answer key. Two careful people would score the same opportunity a 71 and a 79 and both be defensible. So how do you eval that?

You split it. Part of the output does have a right answer, and you check that part directly — the score lands in range, the risk it flags is one that actually exists, the math in the reasoning holds up. The genuinely fuzzy part — is this good judgment? — is where the common move is an LLM-as-judge: you hand a model the opportunity, the score, and a written rubric ("does the reasoning name the dominant risk? does it weight monthly income over a long lockup? is the score directionally sane given the yield and the risk?") and let it grade. It feels circular, and there's a real "who watches the watchmen" caveat to keep in mind, but a judge with a clear rubric is consistent and tireless in a way a human reviewer reading the thousandth opportunity is not.

The skill, it turns out, is mostly in writing the rubric. "Is this a good score?" gives you a meaningless number. "Does it name the dominant risk, respect the income-first thesis, and land within ten points of a sane range?" tells you something true.

Why you hear "eval" everywhere

Part of what makes the word confusing is that it does four different jobs, and people rarely say which one they mean.

It's a test suite — does my system work, run before I ship. It's an optimization target — the number a tuning process tries to push up. It's a public benchmark — the leaderboards like MMLU that models get ranked on. And it's a technique — "an eval" can mean the LLM-as-judge setup itself. Same word, four jobs. When someone says their evals improved, it's worth asking which of the four they're talking about.

It's the same instinct as tests-first

If you've followed this thread, the shape should feel familiar. On the building-with-AI side, I'm stubborn about writing tests before code. On the AI-product side, evals are the same instinct pointed at a different artifact.

Tests are evals for code. Evals are tests for intelligence. Both say the same thing: define what good means and measure against it before you trust the thing in front of a person.

Why it's the foundation

Once you have a metric and a dataset, a lot opens up. You can measure where you stand. You can compare two prompts or two models honestly. You can gate releases so a change that drops quality never reaches a user. And — the part everyone's excited about — you can optimize automatically.

That last one is where tools like DSPy come in. They take your eval and tune the system to move the number for you, instead of you hand-editing prompts and hoping. But notice the order: the tool is the engine, and the eval is the steering wheel. The engine is useless pointed in no particular direction.

So if you're building anything where a model's output is the product, your first real artifact isn't the clever prompt. It's the eval. Next in this thread, I'll get into what "programming, not prompting" means in practice — and what a tool like DSPy does once you've handed it a metric worth chasing.

You can't optimize what you can't measure

What an eval actually is

Why you can't just eyeball it

Scoring the unscoreable

Why you hear "eval" everywhere

It's the same instinct as tests-first

Why it's the foundation

Continue reading

Building with AI is not the same as building an AI product

When to fine-tune an LLM — and when to skip it

Slice 1 of the voice learning loop is live