Build Daily

Tinley Park · May 29, 2026
sdkunsloth.ai (Daniel & Michael Han)watching

unsloth

The free, local-runnable fine-tune accelerator. Wraps PEFT (LoRA / QLoRA) + TRL (SFTTrainer) into one library that's 2–5× faster and uses ~70% less VRAM than the stock toolchain. The honest free-path answer to "where do I actually run a fine-tune?" Runs on Google Colab T4 (free), Kaggle (free), local CUDA / ROCm GPU, or any spot-priced GPU box. Apache 2.0.

Updated May 24, 2026

unsloth is the library that makes "actually fine-tune an 8B model" achievable on free compute. It's the wrapper that turns Hugging Face's peft + trl toolchain into something you can run from a free Colab notebook in an afternoon. This page is the orient-and-set-up surface — official docs at docs.unsloth.ai own the API contract.

What it is

A Python library wrapping transformers + peft + trl with aggressive optimizations: custom kernels, 4-bit quantization done right, and a clean FastLanguageModel class that loads a base model + LoRA adapter in one call. Apache 2.0. pip install unsloth (or the longer install line per the docs — pulls torch + a specific CUDA toolchain).

What it accelerates:

  • QLoRA fine-tuning — 4-bit base model + trainable LoRA adapters. The canonical "fine-tune a 7-8B on a 24GB GPU" technique, made faster.
  • Continued pretraining — domain-adaptive training on raw text. Same shape, different objective.
  • Inference of LoRA-adapted models — the loaded model serves the adapter on top of the quantized base.

The pitch: PEFT + TRL alone do the job, but slowly and with finicky memory. unsloth makes the same job runnable on free compute (Colab T4) in a sane amount of time. The library is the difference between "fine-tuning is a paid step" and "fine-tuning is something I can do this weekend on a free notebook."

When to use it

Reach for it when:

  • The fine-tune target is a supported model family — Llama 3.x, Mistral, Qwen 2.5/3, Gemma 2, Phi-3. The compatibility list is the gate.
  • You want to stay free — Colab + unsloth is the most established no-spend fine-tune path on the open web today.
  • You have a single GPU (consumer or rented) and need every gigabyte of VRAM. unsloth's memory math runs leaner than stock by 30–70%.
  • You want a clean code shape — load model, define LoRA config, run trainer, save adapter. unsloth's API is ~30 lines for a working fine-tune.

Skip it when:

  • The model family isn't supported — fall back to stock transformers + peft + trl.
  • You need multi-GPU distributed training — unsloth has multi-GPU support but stock accelerate is the better-tested path at multi-node scale.
  • You're doing full-fine-tune (not LoRA / QLoRA) on a huge model — different memory math; reach for transformers + DeepSpeed / FSDP.

At a glance

Core surface

from unsloth import FastLanguageModel
from trl import SFTTrainer

# Load base + prepare for LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)

# Train (TRL SFTTrainer with unsloth-aware optimizer)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    args=TrainingArguments(per_device_train_batch_size=2, num_train_epochs=3, ...),
)
trainer.train()

# Save adapter + merge to GGUF for Ollama
model.save_pretrained("my-lora-adapter")
model.save_pretrained_gguf("my-merged-gguf", tokenizer, quantization_method="q4_k_m")

What's bundled

  • FastLanguageModel — the high-level model wrapper. Loads quantized; preps for PEFT; serves at inference.
  • 4-bit kernels — fused matmul kernels for 4-bit weights. The throughput win lives here.
  • get_peft_model — LoRA adapter setup, defaults tuned for the supported families.
  • save_pretrained_gguf — direct export to GGUF for Ollama. The bridge from training to GL serving.
  • save_pretrained_merged — full-precision merged weights for any post-training inference path.

Compute footprint per family

Model family Free-tier path VRAM at QLoRA
Llama 3.1 8B Colab T4 (free) — fits ~7-10 GB
Mistral 7B Colab T4 (free) — fits ~7-9 GB
Qwen 2.5 7B Colab T4 (free) — fits ~7-9 GB
Gemma 2 9B Colab T4 (tight) ~10-12 GB
Llama 3.1 70B Not free — rent an A100 / H100 spot ~40-48 GB

How to integrate

Default integration for a GL fine-tune slice (slice 4 of the agent-stack post):

  1. Pick the compute target.
    • Colab Free for the first run — fast iteration, kills the spend question, T4 GPU.
    • Kaggle if Colab session limits bite — 9h sessions and 30 GPU-hours/week.
    • Local CUDA box if one's available — fastest iteration cycle.
    • vast.ai / RunPod spot if free becomes the bottleneck — ~$1-2/run for an 8B QLoRA.
  2. Format the dataset. JSONL with one row per training example: {"text": "<formatted prompt + response>"} or (input, output) per the chat template. The (content_seed, retrieved_context) → final_post pairs from the drafter's editing loop go here.
  3. Run the notebook. unsloth ships official Colab notebooks per supported family. Clone, swap the dataset, run.
  4. Save the adapter. Adapters are tiny (~50-200 MB for an 8B base). Commit to the repo (or a Hugging Face private repo); the base model is recoverable from the hash pin.
  5. Convert to GGUF. save_pretrained_gguf("...", quantization_method="q4_k_m"). The output is Ollama-loadable.
  6. ollama create. Build a Modelfile that points at the GGUF + sets system prompt + parameters; ollama create my-drafter -f Modelfile. The DSPy module's LM config now points at ollama_chat/my-drafter.
  7. A/B against vanilla. Per the eval plan in the agent-stack post — same DSPy module, two backends. The fine-tune ships only on a margin win.

In the GL stack

builddaily.io

  • Slice 4 fine-tune runtime. The drafter flywheel runs through unsloth on Colab Free (or a spot GPU if iteration speed becomes the bottleneck). Llama 3.1 8B + QLoRA + adapter export + GGUF conversion + Ollama serving — all one pipeline.
  • Voice-tuning iterations. Every ~30 new (draft, edit) pairs trigger a re-tune. The adapter is the unit of versioning; the base model never changes.

paiddaily.io

  • Catalyst-classifier head. If the DSPy classifier outgrows compile-time examples, the same training pipeline produces a small fine-tuned classifier model on a smaller base (Mistral 7B or Qwen 7B).
  • Pendle / Aerodrome voice-tune (later). If a model needs to write in the Trading Almanac voice specifically, same shape — different dataset.

sagedaily.io

  • Sage-voice tune (later). The Oracle voice across the 22+ DSPy modules has thousands of generations behind it. Once Neil's "good vs off-voice" labels accumulate, the same pipeline fine-tunes a Sage-voice base.

Gotchas

  • The Colab session timer. Free Colab kills sessions at 12h (and idle ones earlier). Save checkpoints frequently; design runs to fit in the window.
  • CUDA version pinning. unsloth depends on specific CUDA / PyTorch versions. The install line in the docs is the truth; deviate and the wheels won't link.
  • Model family compatibility is the gate. "It works on Llama" doesn't mean "it works on every model." Check the compatibility list before designing around a base.
  • GGUF export quality. Default q4_k_m quantization is usually fine; for quality-sensitive workloads, try q5_k_m or q8_0. Trade size + speed for fidelity.
  • Merging vs adapter. save_pretrained saves the adapter (small, layered on the base at load time). save_pretrained_merged saves the merged weights (large, standalone). Pick based on whether the serving target wants an adapter or a self-contained model.

Risks

  • Two-person maintainer team. unsloth is a small project; the velocity is real but the bus factor is small. Apache 2.0 means the code stays available; the optimization edge would just stop growing if the team moves on.
  • Optimization moves fast. New kernels, new compatibility, new gotchas every release. Pin versions; budget a re-pin per upgrade cycle.
  • Colab's free tier isn't guaranteed. Google could change T4 availability tomorrow. The free path has the lowest control. Mitigation: the same code runs on Kaggle, RunPod, vast.ai, or local hardware with one config flag change.

Related

  • Hugging Face — the registry unsloth pulls base weights from, and where adapters can be published.
  • Ollama — the serving target after GGUF export.
  • DSPy — the brain layer that wraps the fine-tuned model; one config line swaps vanilla for fine-tuned.