toolOllama (Jeffrey Morgan et al.)applied

Ollama

The local-LLM substrate — one binary that pulls open-weight models from a registry, serves them over a clean HTTP API, and runs the same call surface as OpenAI's chat / embeddings endpoints. Everything local on builddaily.io runs through it today — chat, embeddings, the model targets behind any future agent. MIT licensed; free; the foundation under every other tool on this stack.

Updated May 24, 2026

Ollama is the reason a GL build can stay entirely local. One process serves chat, embeddings, and any future fine-tuned model through the same call surface. This page is the orient-and-anchor surface — official docs at docs.ollama.com own the API contract.

What it is

A self-contained Go binary that pulls open-weight LLMs from a model registry and serves them over an HTTP API on localhost:11434. MIT licensed. Designed to feel like the simplest possible "Docker for models" — ollama pull <model> to fetch, ollama run <model> to chat interactively, and a daemon process exposing /api/chat, /api/generate, /api/embeddings to anything that speaks HTTP.

Runs on macOS (Apple Silicon native via Metal), Linux (CUDA + ROCm), Windows. Pulls quantized GGUF builds of common open-weight models — Llama 3.1 / 3.3, Qwen 2.5 / 3, Mistral, Gemma, Phi, Mixtral, plus embedding-only and reranker-only models. Single-file install; no Python; no container needed.

When to use it

Reach for it when:

You want local inference — no API spend, corpus stays on your machine, latency is round-trip-free.
You're running on Apple Silicon and want Metal-accelerated inference without compiling llama.cpp from source.
You need provider-portable code — Ollama's API surface is close enough to the OpenAI shape that drop-in adapters exist for every popular framework (DSPy, LlamaIndex, LangChain, etc.).
The team has zero appetite for Python ML dependency hell. Ollama is a binary; you pull a model and you're done.
You want a shared local registry — multiple processes on one box can all hit the same Ollama server.

Skip it when:

You need production multi-tenant serving with batched throughput — vLLM and Text Generation Inference win at that scale.
The model you want isn't quantized or isn't in the Ollama library — Ollama supports GGUF; non-GGUF formats need conversion first.
You're running on hardware without enough VRAM / unified memory for even a 7-8B model — at that point any local LLM is the wrong abstraction; reach for hosted.

At a glance

Core surface

ollama pull <model> — download a model from the registry. ~4-5 GB for an 8B Q4 model; cached locally.
ollama run <model> — interactive chat shell (also starts the server if not running).
ollama serve — start the daemon explicitly (auto-starts on first request on most installs).
/api/chat — message-list chat completion. Streams if requested.
/api/generate — single-prompt completion. The lower-level surface.
/api/embeddings — vector embedding for a text input. Required for retrieval.
/api/tags — list locally-pulled models.
ollama show <model> --modelfile — inspect the Modelfile a model was built from.

Models that matter for the GL stack

Model	Purpose	Notes
`qwen3:8b`	General chat backbone	Already on the builddaily.io chat-bridge `ALLOWED_MODELS`. Strong instruction-following at the 8B size class.
`qwen3.6:27b-coding-mxfp8`	Code-aware chat	On the chat-bridge allowlist; reaches for it on engineering questions.
`nomic-embed-text`	Embeddings	On the allowlist but unused for retrieval today — slice 1 of the agent-stack post wires it up.
`llama3.1:8b` (or `llama3.3:8b`)	Fine-tune target	The natural base for slice 4's fine-tune flywheel. Apache 2.0 + Meta community license.

Modelfile (the customization layer)

Models can be customized via a Modelfile — a Dockerfile-like declaration that sets system prompt, parameters (temperature, top_p, num_ctx), and base model. ollama create my-model -f ./Modelfile produces a new local model. Useful for shipping a stable system-prompt-baked-in variant of a base model.

How to integrate

Default integration for a new GL local-LLM surface:

Install. macOS: brew install ollama or the official .app. Linux: the one-line install script. Verify daemon: curl http://localhost:11434/api/tags returns {"models":[...]}.
Pull the models. ollama pull qwen3:8b, ollama pull nomic-embed-text. Cached under ~/.ollama/models/.
Wire to your framework. DSPy: dspy.LM("ollama_chat/qwen3:8b"). LlamaIndex: Ollama(model="qwen3:8b") and OllamaEmbedding(model_name="nomic-embed-text"). Anything OpenAI-shaped works with OLLAMA_API_BASE=http://localhost:11434/v1.
Memory check. Larger models on smaller machines silently fall back to CPU and slow to a crawl. ollama ps shows currently-loaded models and where they're running.
Keep-alive tuning. Default unloads a model 5 minutes after last use. Production pipelines want keep_alive: -1 (load forever) or longer windows.
Tunnel for shared access. The builddaily.io chat-bridge exposes Ollama through a Cloudflare Tunnel + a Python proxy that enforces an ALLOWED_MODELS allowlist and a NEIL_BRIDGE_TOKEN. Mirror that pattern for any other public surface — never expose :11434 directly.

In the GL stack

builddaily.io

Chat-bridge runtime. Already runs on Ollama. Allowed models: qwen3:8b, qwen3.6:27b-coding-mxfp8, nomic-embed-text. Token-auth proxied through Cloudflare Tunnel.
Slice 1 embedding pipeline. The retrieval upgrade in the agent-stack post uses Ollama's /api/embeddings with nomic-embed-text. Zero new infra — model is already on the allowlist.
Slice 4 fine-tune serving. The fine-tuned LoRA-merged 8B model gets ollama create'd and served the same way; the DSPy module config swaps a model name and nothing else changes.

paiddaily.io

Local model fallback for the API. Anything classification-shaped (catalyst type, urgency, pool risk band) compiles to a DSPy module that can hit either Claude or Ollama. Ollama is the cost-zero default for development; promote to Claude only where eval shows quality demands.

sagedaily.io

Offline / cost-bounded operations. Same DSPy-module shape; the same modules that run against Claude in production can flip to Ollama for cost-bounded paths (long retrospectives, batch summarization).

Gotchas

Default context windows are short. Models in Ollama ship with conservative num_ctx defaults (often 2048-4096). For RAG and long-context work, set num_ctx: 8192 (or the model's max) explicitly via Modelfile or per-request options.
Quantization affects quality more than you think. Default pulls are Q4_0 or Q4_K_M — fine for most chat workloads, visible quality drop vs Q8 or fp16 on harder reasoning. Pull a higher quant if quality is the bottleneck.
Embeddings ≠ chat. /api/embeddings is a different endpoint with a different model-load lifecycle; the embedding model loads independently of the chat model. Both can be resident simultaneously — VRAM permitting.
Concurrent requests serialize per-model. Ollama serves one request at a time per loaded model on most setups. High-throughput indexing benefits from running embedding requests in a tight serial loop, not a worker pool.
macOS install runs as a user-launchd agent. Stops when you log out. For server use, either don't log out or wire it as a system daemon.

Risks

Single-vendor binary. Ollama is open source but the binary distribution and registry are run by one team. Apache-2.0 models stay yours either way; the convenience layer is the dependency. Mitigation: GGUF format is portable — any model pulled via Ollama can be served by llama.cpp directly if Ollama disappears.
Quantization quality cliff. Aggressive quantization on small models can degrade structured-output reliability. Build the eval set against the quant you're shipping, not the unquantized weights.
Default open daemon = local attack surface. :11434 listening on localhost is fine; exposed on a network it's an unauthenticated chat completion endpoint. Always proxy.

Alternatives · 5 substitutesPick Ollama unless one of these wins on your specific brief.

01
llama.cpp
The C++ runtime Ollama is built on. Compile-it-yourself, no wrapper.
Wins when ▸you want the raw runtime with no daemon, no registry, and no convenience tax. Manual model loading; manual server flag wrangling. The right tool for embedded use cases (a single model in a single binary).
02
vLLM
Production batched-inference engine for GPU clusters.
Wins when ▸concurrent request load is the constraint. PagedAttention + continuous batching crush serial servers at high QPS. Heavyweight Python install; CUDA-first; overkill for single-laptop use.
03
LM Studio
Desktop-app wrapper around llama.cpp with a GUI model browser.
Wins when ▸you want a click-through UX over a CLI. Also exposes an OpenAI-shaped API. Heavier than Ollama; closed-source desktop app.
04
MLX-LM · Apple
Apple Silicon-native ML framework with an LLM serving frontend.
Wins when ▸you're on Apple Silicon and want bare-metal Metal performance + fine-tuning in the same toolkit. Different model format (MLX) — a separate ecosystem from GGUF.
05
Hosted APIs · Anthropic / OpenAI / Groq / Together
Pay-per-token model APIs — no local infra at all.
Wins when ▸you need top-tier model quality (Claude, GPT-5) or you have no local hardware. Trades zero-local-infra for per-call spend. Most GL DSPy modules can flip between hosted and Ollama via one config line.

DSPy — every brain-layer module that runs locally is one config line away from Ollama.
nomic-embed-text — the embedding model served through Ollama for retrieval.
bge-reranker — the cross-encoder reranker (runs via sentence-transformers, not Ollama; complementary).