unsloth
The free, local-runnable fine-tune accelerator. Wraps PEFT (LoRA / QLoRA) + TRL (SFTTrainer) into one library that's 2–5× faster and uses ~70% less VRAM than the stock toolchain. The honest free-path answer to "where do I actually run a fine-tune?" Runs on Google Colab T4 (free), Kaggle (free), local CUDA / ROCm GPU, or any spot-priced GPU box. Apache 2.0.
unsloth is the library that makes "actually fine-tune an 8B model" achievable on free compute. It's the wrapper that turns Hugging Face's peft + trl toolchain into something you can run from a free Colab notebook in an afternoon. This page is the orient-and-set-up surface — official docs at docs.unsloth.ai own the API contract.
What it is
A Python library wrapping transformers + peft + trl with aggressive optimizations: custom kernels, 4-bit quantization done right, and a clean FastLanguageModel class that loads a base model + LoRA adapter in one call. Apache 2.0. pip install unsloth (or the longer install line per the docs — pulls torch + a specific CUDA toolchain).
What it accelerates:
- QLoRA fine-tuning — 4-bit base model + trainable LoRA adapters. The canonical "fine-tune a 7-8B on a 24GB GPU" technique, made faster.
- Continued pretraining — domain-adaptive training on raw text. Same shape, different objective.
- Inference of LoRA-adapted models — the loaded model serves the adapter on top of the quantized base.
The pitch: PEFT + TRL alone do the job, but slowly and with finicky memory. unsloth makes the same job runnable on free compute (Colab T4) in a sane amount of time. The library is the difference between "fine-tuning is a paid step" and "fine-tuning is something I can do this weekend on a free notebook."
When to use it
Reach for it when:
- The fine-tune target is a supported model family — Llama 3.x, Mistral, Qwen 2.5/3, Gemma 2, Phi-3. The compatibility list is the gate.
- You want to stay free — Colab + unsloth is the most established no-spend fine-tune path on the open web today.
- You have a single GPU (consumer or rented) and need every gigabyte of VRAM. unsloth's memory math runs leaner than stock by 30–70%.
- You want a clean code shape — load model, define LoRA config, run trainer, save adapter. unsloth's API is ~30 lines for a working fine-tune.
Skip it when:
- The model family isn't supported — fall back to stock
transformers+peft+trl. - You need multi-GPU distributed training — unsloth has multi-GPU support but stock
accelerateis the better-tested path at multi-node scale. - You're doing full-fine-tune (not LoRA / QLoRA) on a huge model — different memory math; reach for
transformers+ DeepSpeed / FSDP.
At a glance
Core surface
from unsloth import FastLanguageModel
from trl import SFTTrainer
# Load base + prepare for LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, lora_alpha=32,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
# Train (TRL SFTTrainer with unsloth-aware optimizer)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=4096,
args=TrainingArguments(per_device_train_batch_size=2, num_train_epochs=3, ...),
)
trainer.train()
# Save adapter + merge to GGUF for Ollama
model.save_pretrained("my-lora-adapter")
model.save_pretrained_gguf("my-merged-gguf", tokenizer, quantization_method="q4_k_m")
What's bundled
FastLanguageModel— the high-level model wrapper. Loads quantized; preps for PEFT; serves at inference.- 4-bit kernels — fused matmul kernels for 4-bit weights. The throughput win lives here.
get_peft_model— LoRA adapter setup, defaults tuned for the supported families.save_pretrained_gguf— direct export to GGUF for Ollama. The bridge from training to GL serving.save_pretrained_merged— full-precision merged weights for any post-training inference path.
Compute footprint per family
| Model family | Free-tier path | VRAM at QLoRA |
|---|---|---|
| Llama 3.1 8B | Colab T4 (free) — fits | ~7-10 GB |
| Mistral 7B | Colab T4 (free) — fits | ~7-9 GB |
| Qwen 2.5 7B | Colab T4 (free) — fits | ~7-9 GB |
| Gemma 2 9B | Colab T4 (tight) | ~10-12 GB |
| Llama 3.1 70B | Not free — rent an A100 / H100 spot | ~40-48 GB |
How to integrate
Default integration for a GL fine-tune slice (slice 4 of the agent-stack post):
- Pick the compute target.
- Colab Free for the first run — fast iteration, kills the spend question, T4 GPU.
- Kaggle if Colab session limits bite — 9h sessions and 30 GPU-hours/week.
- Local CUDA box if one's available — fastest iteration cycle.
- vast.ai / RunPod spot if free becomes the bottleneck — ~$1-2/run for an 8B QLoRA.
- Format the dataset. JSONL with one row per training example:
{"text": "<formatted prompt + response>"}or(input, output)per the chat template. The(content_seed, retrieved_context) → final_postpairs from the drafter's editing loop go here. - Run the notebook. unsloth ships official Colab notebooks per supported family. Clone, swap the dataset, run.
- Save the adapter. Adapters are tiny (~50-200 MB for an 8B base). Commit to the repo (or a Hugging Face private repo); the base model is recoverable from the hash pin.
- Convert to GGUF.
save_pretrained_gguf("...", quantization_method="q4_k_m"). The output is Ollama-loadable. ollama create. Build a Modelfile that points at the GGUF + sets system prompt + parameters;ollama create my-drafter -f Modelfile. The DSPy module's LM config now points atollama_chat/my-drafter.- A/B against vanilla. Per the eval plan in the agent-stack post — same DSPy module, two backends. The fine-tune ships only on a margin win.
In the GL stack
builddaily.io
- Slice 4 fine-tune runtime. The drafter flywheel runs through unsloth on Colab Free (or a spot GPU if iteration speed becomes the bottleneck). Llama 3.1 8B + QLoRA + adapter export + GGUF conversion + Ollama serving — all one pipeline.
- Voice-tuning iterations. Every ~30 new (draft, edit) pairs trigger a re-tune. The adapter is the unit of versioning; the base model never changes.
paiddaily.io
- Catalyst-classifier head. If the DSPy classifier outgrows compile-time examples, the same training pipeline produces a small fine-tuned classifier model on a smaller base (Mistral 7B or Qwen 7B).
- Pendle / Aerodrome voice-tune (later). If a model needs to write in the Trading Almanac voice specifically, same shape — different dataset.
sagedaily.io
- Sage-voice tune (later). The Oracle voice across the 22+ DSPy modules has thousands of generations behind it. Once Neil's "good vs off-voice" labels accumulate, the same pipeline fine-tunes a Sage-voice base.
Gotchas
- The Colab session timer. Free Colab kills sessions at 12h (and idle ones earlier). Save checkpoints frequently; design runs to fit in the window.
- CUDA version pinning. unsloth depends on specific CUDA / PyTorch versions. The install line in the docs is the truth; deviate and the wheels won't link.
- Model family compatibility is the gate. "It works on Llama" doesn't mean "it works on every model." Check the compatibility list before designing around a base.
- GGUF export quality. Default
q4_k_mquantization is usually fine; for quality-sensitive workloads, tryq5_k_morq8_0. Trade size + speed for fidelity. - Merging vs adapter.
save_pretrainedsaves the adapter (small, layered on the base at load time).save_pretrained_mergedsaves the merged weights (large, standalone). Pick based on whether the serving target wants an adapter or a self-contained model.
Risks
- Two-person maintainer team. unsloth is a small project; the velocity is real but the bus factor is small. Apache 2.0 means the code stays available; the optimization edge would just stop growing if the team moves on.
- Optimization moves fast. New kernels, new compatibility, new gotchas every release. Pin versions; budget a re-pin per upgrade cycle.
- Colab's free tier isn't guaranteed. Google could change T4 availability tomorrow. The free path has the lowest control. Mitigation: the same code runs on Kaggle, RunPod, vast.ai, or local hardware with one config flag change.
Related
- Hugging Face — the registry unsloth pulls base weights from, and where adapters can be published.
- Ollama — the serving target after GGUF export.
- DSPy — the brain layer that wraps the fine-tuned model; one config line swaps vanilla for fine-tuned.
