When to fine-tune an LLM — and when to skip it

Saturday, May 23, 2026

Fine-tuning is the first lever most teams reach for when an LLM isn't behaving. It's the right lever less often than the reach for it suggests.

Most workloads don't need it. Some absolutely do. The trick is knowing which one you're holding before you sink a quarter into a fine-tune that prompting could have done in a week — or worse, into a fine-tune that's obsolete the next time the base model jumps.

A decision matrix to start, so you can see the call before the prose:

↗ click to enlarge

The prose below expands each row.

When to skip it (the default)

Five tests. If any is true, fine-tuning is probably not where the win is.

The prompting path is unproven. A signature, a few typed examples, an optimizer like DSPy compiling against an eval set — that's a week of work and it gets you most of what people use fine-tuning for. Format compliance, domain language, smaller prompts, fewer retries. If you haven't compiled a prompted module yet, you don't know how much daylight is left.

The eval set doesn't exist. Fine-tuning without an eval is optimizing without profiling. You'll get a number — accuracy, BLEU, win rate, whatever — and you'll defend the number because you produced it. The eval set is the unblock. Without it, no technique you reach for is measurable.

The frontier is still moving fast. The gap between frontier model generations — Opus, Sonnet, Haiku, GPT, Gemini, the lot — is currently measured in months, and each generation shifts the prompt-to-output curve enough that a fine-tune on the prior generation has to be redone, or worse, gets beaten by a vanilla prompt on the new one. The price of waiting another quarter is usually negative.

Call volume is low. Inference on a frontier model is a known per-call cost. A fine-tune is a fixed setup cost, an ongoing hosting cost (or a per-token premium on the hosted version), and a maintenance tax every time the base model jumps. Below some threshold, vanilla wins on math alone. For most teams that threshold is higher than they think — somewhere north of 10M tokens/day on a single task.

The failure is upstream. When the agent goes sideways, look at the trace. Most of the time it's retrieval being wrong, the spec being loose, the routing being off, a tool returning something the prompt didn't anticipate. Fine-tuning addresses the model. The model is usually fine.

If you read those five and nothing pulled, the decision is easy. Don't fine-tune. Spend the same hours on the eval set, the retrieval layer, and the spec, and you'll get a bigger lift.

When to reach for it

Four conditions. Any one of these is a real reason; two stacked is a slam dunk.

Latency. You need sub-100ms responses and a prompted 8B model can't get there. A 1B or 3B model fine-tuned on a single narrow task can. Real-time pipelines, interactive UX, on-device inference — this is where small-fine-tuned beats large-prompted on raw math.

Format compliance the prompt path can't reach. A domain-specific grammar — a query DSL, a structured output the upstream library refuses to parse, a function-calling surface where any deviation breaks the tool. You've tried prompting plus structured-output plus retries and the failure rate still sits above the bar the product needs. Fine-tuning shapes the output distribution at the weights level in a way prompting can only approximate.

Volume that flips the math. The per-call premium on a hosted fine-tune drops below the cost of running the vanilla model at the same throughput. Or you're at a scale where shaving 30% off prompt tokens is worth a fixed annual setup. Both pencil out somewhere past the 10M-tokens-per-day-per-task mark.

Labels falling out of normal operation. Every production call generates a clean labeled example as a side effect — a user accepted or edited the output, a downstream system reported success, a reviewer rated it. The eval set builds itself, the training set builds itself, and the marginal cost of the next fine-tune approaches zero. At that point the only remaining question is whether the lift is real, and you can measure that cheaply.

If two of these are true at the same time — say, latency and labels-falling-out — the case is overwhelming. One alone usually still wins.

How to actually do it

If you've decided yes, the work is more procedural than people assume. Seven steps:

01
Narrow the taskScope
Not "make the agent smarter." A specific scoped behavior: classify a ticket into one of fourteen routes; rewrite a paragraph in the brand voice; emit a SQL query against a fixed schema. The narrower the task, the better the fine-tune.
02
Pull a clean dataset500–5,000 examples
The surprising number. People assume fine-tuning needs millions; for most narrow tasks, low thousands is the sweet spot. Past that, diminishing returns hit fast. Quality of labels beats quantity every time.
03
Pick a small base modelHaiku-class
Almost always a better return than fine-tuning a large one. A Haiku takes the shape of the task; an Opus already knew the task and you're paying to nudge it. (And in practice the largest models aren't fine-tunable on most providers; the option set tilts you toward small whether you wanted it or not.)
04
Hold out an eval setMeasurement
Same labels, never seen during training. Usually 10–20% of the dataset. The single measurement that tells you whether the fine-tune is doing anything real. Without it, you're guessing.
05
TrainLoRA · 1–3 epochs
Provider-hosted via the API (Anthropic, OpenAI, Bedrock) takes minutes to kick off and hours to finish. Self-hosted via Unsloth or Axolotl on a rented A100 runs in under an hour for most datasets — or via MLX on an Apple Silicon Mac overnight if you'd rather not touch cloud. LoRA adapters keep the artifact small and cheap to swap.
06
Compare against baselinevs prompted frontier
Run the held-out eval against both the fine-tune and the prompted frontier baseline. Score accuracy, p95 latency, cost per call. If the fine-tune doesn't decisively beat the baseline on at least one of the three, retire it before it ships.
07
Plan for maintenanceRe-train quarterly
When the base model gets an upgrade, re-train from the same dataset and re-evaluate. If the new vanilla base beats your old fine-tune, retire the fine-tune. That discipline is what separates a useful fine-tune from a frozen artifact.

What it costs

Four shapes the bill arrives in. Pick the one that fits where you'll run it.

↗ click to enlarge

One project we can share — The Ghost

A small fine-tuned drafter — a ghostwriter for this site. Trained on every builddaily.io post I've published (plus the heaviest daily logs, where the voice is rawest). Takes a seed — topic, angle, target length, kind — and emits a first draft I can edit instead of rewrite. Every edit-diff goes back into the training set; the more I publish, the closer the next draft starts to where I'd have shipped anyway.

A fit note: of all the candidates I considered for this slot, The Ghost is the one where n=1 actually works. Two reach conditions stack cleanly — format compliance (voice is the trickiest format there is; few-shot prompting gets the diction but never quite nails cadence, the opinion-vs-disclaimer balance, the rhythm of section breaks) and labels-falling-out-of-operation (every published post is a positive example, every edit-diff between draft and final is a high-signal label). The volume condition doesn't apply — I write ~one post a week, not millions — but volume isn't what makes this case; the format gap is. The cold start is solid because the archive already has a dozen published posts plus a year of daily logs to seed from.

↗ click to enlarge

The path through:

The seed is a few lines — topic: X, angle: Y, target length: Z, kind: editorial/essay. Could be as terse as a chat message; could be a fuller brief if the topic warrants.
The Ghost emits a first draft. Architecturally: a small base model (Haiku 4.5 via the Anthropic fine-tune API, or a local 3B via MLX on the MacBook) plus a LoRA adapter trained on the post archive plus the heaviest daily logs. The base supplies competence; the adapter supplies voice.
The draft comes back. Some passes will be most-of-a-post; others will be a scaffold I rewrite. Either is useful — the rewrite is the label.
I edit and ship. The edit-diff (what I rewrote, deleted, added) lands in the next training row. Voice doesn't come from the published version alone; it comes from the gap between what the model emitted and what I shipped. That gap is the highest-signal label this loop produces.

Why this project demonstrates the case:

Format compliance is the cleanest fit there is. Voice is the format prompting can't reliably reach. Few-shot examples in context get diction; they don't get the rhythm of how I open a section, the asymmetric weight between opinion and disclaimer, the specific places where I let a sentence run long. A fine-tune on the archive captures that distribution at the weights level.
Labels build themselves. Every published post is a positive example; every edit-diff is a graded example — what was off, what was on. Higher signal than typical accept/dismiss labels.
n=1 is fine here. Unlike a routing or per-user-personalization play, this fine-tune isn't competing against a frontier-prompted version on cost or latency. It's competing on voice fidelity. At any scale where you're writing in your own voice consistently, the fine-tune beats the prompt — because the prompt context window is finite and the archive isn't.
The first ship is the demo. No public repo, no separate test surface. The next builddaily.io post comes through The Ghost. If the first draft reads closer to publishable than rewriting-from-scratch, the loop earns its slot in the stack. If it doesn't, this post is the only one on the topic.

What's next

First slice — The Ghost v0 on either Haiku 4.5 (Anthropic fine-tune API, single-shot ~$25) or a local 3B via MLX on the MacBook (overnight, free) — target by 2026-06-06. The next post on this site comes through it. The proof of work is the post itself.

The trigger matters more than the technique

The interesting question is rarely how to fine-tune — the playbooks are public, the providers do most of the lifting, the gap between a mediocre fine-tune and a good one is mostly dataset quality.

The interesting question is whether you should — and for most workloads the honest answer is no, with the time better spent on the eval set or the retrieval layer. For The Ghost, the answer is yes. The next post on this site is how we'll find out if I was right.

When to fine-tune an LLM — and when to skip it

When to skip it (the default)

When to reach for it

How to actually do it

What it costs

One project we can share — The Ghost

What's next

The trigger matters more than the technique

Continue reading

You can't optimize what you can't measure

Slice 1 of the voice learning loop is live

Teaching an agent to draft in my voice