Ollama
The local-LLM substrate — one binary that pulls open-weight models from a registry, serves them over a clean HTTP API, and runs the same call surface as OpenAI's chat / embeddings endpoints. Everything local on builddaily.io runs through it today — chat, embeddings, the model targets behind any future agent. MIT licensed; free; the foundation under every other tool on this stack.
Ollama is the reason a GL build can stay entirely local. One process serves chat, embeddings, and any future fine-tuned model through the same call surface. This page is the orient-and-anchor surface — official docs at docs.ollama.com own the API contract.
What it is
A self-contained Go binary that pulls open-weight LLMs from a model registry and serves them over an HTTP API on localhost:11434. MIT licensed. Designed to feel like the simplest possible "Docker for models" — ollama pull <model> to fetch, ollama run <model> to chat interactively, and a daemon process exposing /api/chat, /api/generate, /api/embeddings to anything that speaks HTTP.
Runs on macOS (Apple Silicon native via Metal), Linux (CUDA + ROCm), Windows. Pulls quantized GGUF builds of common open-weight models — Llama 3.1 / 3.3, Qwen 2.5 / 3, Mistral, Gemma, Phi, Mixtral, plus embedding-only and reranker-only models. Single-file install; no Python; no container needed.
When to use it
Reach for it when:
- You want local inference — no API spend, corpus stays on your machine, latency is round-trip-free.
- You're running on Apple Silicon and want Metal-accelerated inference without compiling llama.cpp from source.
- You need provider-portable code — Ollama's API surface is close enough to the OpenAI shape that drop-in adapters exist for every popular framework (DSPy, LlamaIndex, LangChain, etc.).
- The team has zero appetite for Python ML dependency hell. Ollama is a binary; you pull a model and you're done.
- You want a shared local registry — multiple processes on one box can all hit the same Ollama server.
Skip it when:
- You need production multi-tenant serving with batched throughput —
vLLMandText Generation Inferencewin at that scale. - The model you want isn't quantized or isn't in the Ollama library — Ollama supports GGUF; non-GGUF formats need conversion first.
- You're running on hardware without enough VRAM / unified memory for even a 7-8B model — at that point any local LLM is the wrong abstraction; reach for hosted.
At a glance
Core surface
ollama pull <model>— download a model from the registry. ~4-5 GB for an 8B Q4 model; cached locally.ollama run <model>— interactive chat shell (also starts the server if not running).ollama serve— start the daemon explicitly (auto-starts on first request on most installs)./api/chat— message-list chat completion. Streams if requested./api/generate— single-prompt completion. The lower-level surface./api/embeddings— vector embedding for a text input. Required for retrieval./api/tags— list locally-pulled models.ollama show <model> --modelfile— inspect the Modelfile a model was built from.
Models that matter for the GL stack
| Model | Purpose | Notes |
|---|---|---|
qwen3:8b |
General chat backbone | Already on the builddaily.io chat-bridge ALLOWED_MODELS. Strong instruction-following at the 8B size class. |
qwen3.6:27b-coding-mxfp8 |
Code-aware chat | On the chat-bridge allowlist; reaches for it on engineering questions. |
nomic-embed-text |
Embeddings | On the allowlist but unused for retrieval today — slice 1 of the agent-stack post wires it up. |
llama3.1:8b (or llama3.3:8b) |
Fine-tune target | The natural base for slice 4's fine-tune flywheel. Apache 2.0 + Meta community license. |
Modelfile (the customization layer)
Models can be customized via a Modelfile — a Dockerfile-like declaration that sets system prompt, parameters (temperature, top_p, num_ctx), and base model. ollama create my-model -f ./Modelfile produces a new local model. Useful for shipping a stable system-prompt-baked-in variant of a base model.
How to integrate
Default integration for a new GL local-LLM surface:
- Install. macOS:
brew install ollamaor the official.app. Linux: the one-line install script. Verify daemon:curl http://localhost:11434/api/tagsreturns{"models":[...]}. - Pull the models.
ollama pull qwen3:8b,ollama pull nomic-embed-text. Cached under~/.ollama/models/. - Wire to your framework. DSPy:
dspy.LM("ollama_chat/qwen3:8b"). LlamaIndex:Ollama(model="qwen3:8b")andOllamaEmbedding(model_name="nomic-embed-text"). Anything OpenAI-shaped works withOLLAMA_API_BASE=http://localhost:11434/v1. - Memory check. Larger models on smaller machines silently fall back to CPU and slow to a crawl.
ollama psshows currently-loaded models and where they're running. - Keep-alive tuning. Default unloads a model 5 minutes after last use. Production pipelines want
keep_alive: -1(load forever) or longer windows. - Tunnel for shared access. The builddaily.io chat-bridge exposes Ollama through a Cloudflare Tunnel + a Python proxy that enforces an
ALLOWED_MODELSallowlist and aNEIL_BRIDGE_TOKEN. Mirror that pattern for any other public surface — never expose:11434directly.
In the GL stack
builddaily.io
- Chat-bridge runtime. Already runs on Ollama. Allowed models:
qwen3:8b,qwen3.6:27b-coding-mxfp8,nomic-embed-text. Token-auth proxied through Cloudflare Tunnel. - Slice 1 embedding pipeline. The retrieval upgrade in the agent-stack post uses Ollama's
/api/embeddingswithnomic-embed-text. Zero new infra — model is already on the allowlist. - Slice 4 fine-tune serving. The fine-tuned LoRA-merged 8B model gets
ollama create'd and served the same way; the DSPy module config swaps a model name and nothing else changes.
paiddaily.io
- Local model fallback for the API. Anything classification-shaped (catalyst type, urgency, pool risk band) compiles to a DSPy module that can hit either Claude or Ollama. Ollama is the cost-zero default for development; promote to Claude only where eval shows quality demands.
sagedaily.io
- Offline / cost-bounded operations. Same DSPy-module shape; the same modules that run against Claude in production can flip to Ollama for cost-bounded paths (long retrospectives, batch summarization).
Gotchas
- Default context windows are short. Models in Ollama ship with conservative
num_ctxdefaults (often 2048-4096). For RAG and long-context work, setnum_ctx: 8192(or the model's max) explicitly via Modelfile or per-request options. - Quantization affects quality more than you think. Default pulls are Q4_0 or Q4_K_M — fine for most chat workloads, visible quality drop vs Q8 or fp16 on harder reasoning. Pull a higher quant if quality is the bottleneck.
- Embeddings ≠ chat.
/api/embeddingsis a different endpoint with a different model-load lifecycle; the embedding model loads independently of the chat model. Both can be resident simultaneously — VRAM permitting. - Concurrent requests serialize per-model. Ollama serves one request at a time per loaded model on most setups. High-throughput indexing benefits from running embedding requests in a tight serial loop, not a worker pool.
- macOS install runs as a user-launchd agent. Stops when you log out. For server use, either don't log out or wire it as a system daemon.
Risks
- Single-vendor binary. Ollama is open source but the binary distribution and registry are run by one team. Apache-2.0 models stay yours either way; the convenience layer is the dependency. Mitigation: GGUF format is portable — any model pulled via Ollama can be served by
llama.cppdirectly if Ollama disappears. - Quantization quality cliff. Aggressive quantization on small models can degrade structured-output reliability. Build the eval set against the quant you're shipping, not the unquantized weights.
- Default open daemon = local attack surface.
:11434listening on localhost is fine; exposed on a network it's an unauthenticated chat completion endpoint. Always proxy.
Related
- DSPy — every brain-layer module that runs locally is one config line away from Ollama.
- nomic-embed-text — the embedding model served through Ollama for retrieval.
- bge-reranker — the cross-encoder reranker (runs via
sentence-transformers, not Ollama; complementary).
