Prompt engineering was never the bottleneck

Saturday, June 27, 2026

When something I build with an LLM misbehaves in production, the reflex is to go rewrite the prompt. Most of the time the prompt was fine, and the thing that actually broke was everything around it.

That gap — between the words you send the model and the system that runs the model — is the difference between a demo and a product. The industry spent two years calling the first part "prompt engineering" and treating it as the skill. The part that decides whether your thing survives contact with real users has a less catchy name: the harness.

↗ click to enlarge

Two different jobs

Prompt engineering is the craft of wording the input — the instructions, the examples, the format you ask for. It's real, and it matters, and it's also the part that gets all the attention because it's the part you can see. You open a text box, you change some words, the output changes. The feedback loop is immediate and addictive.

Harness engineering is the craft of building the system the model runs inside: the control loop, the tool wiring, the validation that catches a malformed response, the retry that handles a flaky call, the fallback when the first model is down, the budget that stops an agent from looping forever, and the state it carries between steps. None of that is visible in a chat window, and all of it is what fails at 2am.

The prompt is the screenplay. The harness is the film crew, the editing room, and the projector. A good screenplay with no crew is a PDF.

What the harness is actually made of

When I say harness, here's the concrete list — the parts I find myself building and rebuilding on every real project:

The control loop. What runs, in what order, and when it stops. An agent without a hard stop condition is a way to spend money in your sleep.
Structured output handling. The model returns malformed JSON eventually — not often, but eventually. You validate against a schema, repair what you can, and fall back when you can't.
Retries and fallback chains. A flaky API call gets re-attempted. A model that times out gets a smaller, faster backup. The user never sees the wobble.
Tool contracts. When the model calls a function, the arguments can be wrong or invented. You validate them before anything executes, and you make the tools safe to call twice.
Budgets and termination. Maximum steps, maximum tool calls, maximum spend. Guardrails, not vibes.
State and memory. What the system remembers between turns, and what it deliberately forgets.
Context assembly. What goes into the window on each call, in what order, trimmed to fit.

That last one deserves its own name, because the field gave it one: context engineering. It's often filed under prompting, but it isn't a wording problem. It's a systems problem — retrieval, ordering, compaction, deciding what earns a place in a limited window and what gets cut. You're managing a scarce resource under a hard budget. That's engineering, and it lives in the harness.

Why the prompt gets the credit anyway

Two reasons. The prompt is visible, and the prompt is where the magic feels like it lives. You change a sentence and the model gets smarter in front of you, so your brain assigns the win to the sentence.

But watch where the time actually goes once something is live. The bug reports aren't "the wording was off." They're "it crashed on this one weird input," "it called the tool with a null and fell over," "it returned half a JSON object," "it ran for ninety seconds because nothing told it to stop." Every one of those is a harness failure wearing a prompt costume. You can rewrite the instructions all day and never touch the actual fault.

The demo runs on the happy path, where the prompt is the whole story. Production runs on the long tail, where the harness is.

The test I use now

Here's the question that sorts the two cleanly: if you swapped the model underneath, would your thing still work?

If you can drop in a different model — a cheaper one, a local one, next quarter's release — and the system keeps doing its job because the validation, retries, and control flow hold the shape, then you built a harness. The model is a component, and components are replaceable.

If swapping the model breaks everything because the whole behavior was balanced on one model's quirks and one carefully-tuned prompt, then you didn't build a system. You built a prompt with extra steps, and you're one model deprecation away from starting over.

This is also why evals matter more than instinct: an eval measures the whole harness on real inputs, not the prompt on the inputs you happened to imagine. It's the only way to know whether a change helped or just moved the failure somewhere you weren't looking.

Where this leaves me

I run a stack of agents that make decisions with real consequences, and almost none of the engineering that keeps them honest is in the prompts. It's in the layer that refuses to crash when a side input goes bad, that degrades to a safe answer instead of a stack trace, that stops itself before it runs the bill up. The prompts are maybe a fifth of the work. The harness is the rest, and the rest is what I'd put my name on.

So when people ask what skill to build for this wave, I don't say prompt engineering. I say learn to build the thing around the model — the retries, the validation, the loops, the budgets, the question of how many agents you even need. The prompt was never the bottleneck. It was just the part standing in the light.

Prompt engineering was never the bottleneck

Two different jobs

What the harness is actually made of

Why the prompt gets the credit anyway

The test I use now

Where this leaves me

Continue reading

Anatomy of an AI product

DSPy vs LangChain for typed LLM programming

What is DSPy?