Build Daily

Tinley Park · June 28, 2026

Prompt engineering was never the bottleneck

When something I build with an LLM misbehaves in production, the reflex is to go rewrite the prompt. Most of the time the prompt was fine, and the thing that actually broke was everything around it.

That gap — between the words you send the model and the system that runs the model — is the difference between a demo and a product. The industry spent two years calling the first part "prompt engineering" and treating it as the skill. The part that decides whether your thing survives contact with real users has a less catchy name: the harness.

Prompt vs harness A small lit prompt card above the line — the visible fifth of the work — sitting over a large shaded harness panel below the line that holds the seven parts that run in production: control loop, structured output, retries and fallback, tool contracts, budgets and termination, state and memory, and context assembly. A footer states the test: swap the model and a harness keeps its shape. PROMPT vs HARNESS The model is a fifth of the work. The system around it is the rest. ▲ IN THE LIGHT · what you see · ≈ 1/5 THE PROMPT the wording you can see · instructions, examples, format the whole story in a demo · the part standing in the light PRODUCTION ▼ IN THE DARK · what fails at 2am · ≈ 4/5 THE HARNESS the system the model runs inside CONTROL LOOP what runs, in what order, and when it stops STRUCTURED OUTPUT validate the JSON · repair, or fall back RETRIES & FALLBACK re-attempt flaky calls · smaller backup model TOOL CONTRACTS check the arguments before anything runs BUDGETS & STOPS max steps · max calls · max spend STATE & MEMORY what it keeps between turns, and forgets CONTEXT ASSEMBLY what earns a place in a limited window THE TEST swap the model: a harness keeps its shape; a prompt with extra steps starts over.
↗ click to enlarge

Two different jobs

Prompt engineering is the craft of wording the input — the instructions, the examples, the format you ask for. It's real, and it matters, and it's also the part that gets all the attention because it's the part you can see. You open a text box, you change some words, the output changes. The feedback loop is immediate and addictive.

Harness engineering is the craft of building the system the model runs inside: the control loop, the tool wiring, the validation that catches a malformed response, the retry that handles a flaky call, the fallback when the first model is down, the budget that stops an agent from looping forever, and the state it carries between steps. None of that is visible in a chat window, and all of it is what fails at 2am.

The prompt is the screenplay. The harness is the film crew, the editing room, and the projector. A good screenplay with no crew is a PDF.

What the harness is actually made of

When I say harness, here's the concrete list — the parts I find myself building and rebuilding on every real project:

  • The control loop. What runs, in what order, and when it stops. An agent without a hard stop condition is a way to spend money in your sleep.
  • Structured output handling. The model returns malformed JSON eventually — not often, but eventually. You validate against a schema, repair what you can, and fall back when you can't.
  • Retries and fallback chains. A flaky API call gets re-attempted. A model that times out gets a smaller, faster backup. The user never sees the wobble.
  • Tool contracts. When the model calls a function, the arguments can be wrong or invented. You validate them before anything executes, and you make the tools safe to call twice.
  • Budgets and termination. Maximum steps, maximum tool calls, maximum spend. Guardrails, not vibes.
  • State and memory. What the system remembers between turns, and what it deliberately forgets.
  • Context assembly. What goes into the window on each call, in what order, trimmed to fit.

That last one deserves its own name, because the field gave it one: context engineering. It's often filed under prompting, but it isn't a wording problem. It's a systems problem — retrieval, ordering, compaction, deciding what earns a place in a limited window and what gets cut. You're managing a scarce resource under a hard budget. That's engineering, and it lives in the harness.

Why the prompt gets the credit anyway

Two reasons. The prompt is visible, and the prompt is where the magic feels like it lives. You change a sentence and the model gets smarter in front of you, so your brain assigns the win to the sentence.

But watch where the time actually goes once something is live. The bug reports aren't "the wording was off." They're "it crashed on this one weird input," "it called the tool with a null and fell over," "it returned half a JSON object," "it ran for ninety seconds because nothing told it to stop." Every one of those is a harness failure wearing a prompt costume. You can rewrite the instructions all day and never touch the actual fault.

The demo runs on the happy path, where the prompt is the whole story. Production runs on the long tail, where the harness is.

The test I use now

Here's the question that sorts the two cleanly: if you swapped the model underneath, would your thing still work?

If you can drop in a different model — a cheaper one, a local one, next quarter's release — and the system keeps doing its job because the validation, retries, and control flow hold the shape, then you built a harness. The model is a component, and components are replaceable.

If swapping the model breaks everything because the whole behavior was balanced on one model's quirks and one carefully-tuned prompt, then you didn't build a system. You built a prompt with extra steps, and you're one model deprecation away from starting over.

This is also why evals matter more than instinct: an eval measures the whole harness on real inputs, not the prompt on the inputs you happened to imagine. It's the only way to know whether a change helped or just moved the failure somewhere you weren't looking.

Where this leaves me

I run a stack of agents that make decisions with real consequences, and almost none of the engineering that keeps them honest is in the prompts. It's in the layer that refuses to crash when a side input goes bad, that degrades to a safe answer instead of a stack trace, that stops itself before it runs the bill up. The prompts are maybe a fifth of the work. The harness is the rest, and the rest is what I'd put my name on.

So when people ask what skill to build for this wave, I don't say prompt engineering. I say learn to build the thing around the model — the retries, the validation, the loops, the budgets, the question of how many agents you even need. The prompt was never the bottleneck. It was just the part standing in the light.

  • #agents
  • #ai
  • #llm
  • #building-in-public
  • #engineering

Continue reading