Build Daily

Tinley Park · May 29, 2026
toolOllama (Jeffrey Morgan et al.)applied

Ollama

The local-LLM substrate — one binary that pulls open-weight models from a registry, serves them over a clean HTTP API, and runs the same call surface as OpenAI's chat / embeddings endpoints. Everything local on builddaily.io runs through it today — chat, embeddings, the model targets behind any future agent. MIT licensed; free; the foundation under every other tool on this stack.

Updated May 24, 2026

Ollama is the reason a GL build can stay entirely local. One process serves chat, embeddings, and any future fine-tuned model through the same call surface. This page is the orient-and-anchor surface — official docs at docs.ollama.com own the API contract.

What it is

A self-contained Go binary that pulls open-weight LLMs from a model registry and serves them over an HTTP API on localhost:11434. MIT licensed. Designed to feel like the simplest possible "Docker for models" — ollama pull <model> to fetch, ollama run <model> to chat interactively, and a daemon process exposing /api/chat, /api/generate, /api/embeddings to anything that speaks HTTP.

Runs on macOS (Apple Silicon native via Metal), Linux (CUDA + ROCm), Windows. Pulls quantized GGUF builds of common open-weight models — Llama 3.1 / 3.3, Qwen 2.5 / 3, Mistral, Gemma, Phi, Mixtral, plus embedding-only and reranker-only models. Single-file install; no Python; no container needed.

When to use it

Reach for it when:

  • You want local inference — no API spend, corpus stays on your machine, latency is round-trip-free.
  • You're running on Apple Silicon and want Metal-accelerated inference without compiling llama.cpp from source.
  • You need provider-portable code — Ollama's API surface is close enough to the OpenAI shape that drop-in adapters exist for every popular framework (DSPy, LlamaIndex, LangChain, etc.).
  • The team has zero appetite for Python ML dependency hell. Ollama is a binary; you pull a model and you're done.
  • You want a shared local registry — multiple processes on one box can all hit the same Ollama server.

Skip it when:

  • You need production multi-tenant serving with batched throughput — vLLM and Text Generation Inference win at that scale.
  • The model you want isn't quantized or isn't in the Ollama library — Ollama supports GGUF; non-GGUF formats need conversion first.
  • You're running on hardware without enough VRAM / unified memory for even a 7-8B model — at that point any local LLM is the wrong abstraction; reach for hosted.

At a glance

Core surface

  • ollama pull <model> — download a model from the registry. ~4-5 GB for an 8B Q4 model; cached locally.
  • ollama run <model> — interactive chat shell (also starts the server if not running).
  • ollama serve — start the daemon explicitly (auto-starts on first request on most installs).
  • /api/chat — message-list chat completion. Streams if requested.
  • /api/generate — single-prompt completion. The lower-level surface.
  • /api/embeddings — vector embedding for a text input. Required for retrieval.
  • /api/tags — list locally-pulled models.
  • ollama show <model> --modelfile — inspect the Modelfile a model was built from.

Models that matter for the GL stack

Model Purpose Notes
qwen3:8b General chat backbone Already on the builddaily.io chat-bridge ALLOWED_MODELS. Strong instruction-following at the 8B size class.
qwen3.6:27b-coding-mxfp8 Code-aware chat On the chat-bridge allowlist; reaches for it on engineering questions.
nomic-embed-text Embeddings On the allowlist but unused for retrieval today — slice 1 of the agent-stack post wires it up.
llama3.1:8b (or llama3.3:8b) Fine-tune target The natural base for slice 4's fine-tune flywheel. Apache 2.0 + Meta community license.

Modelfile (the customization layer)

Models can be customized via a Modelfile — a Dockerfile-like declaration that sets system prompt, parameters (temperature, top_p, num_ctx), and base model. ollama create my-model -f ./Modelfile produces a new local model. Useful for shipping a stable system-prompt-baked-in variant of a base model.

How to integrate

Default integration for a new GL local-LLM surface:

  1. Install. macOS: brew install ollama or the official .app. Linux: the one-line install script. Verify daemon: curl http://localhost:11434/api/tags returns {"models":[...]}.
  2. Pull the models. ollama pull qwen3:8b, ollama pull nomic-embed-text. Cached under ~/.ollama/models/.
  3. Wire to your framework. DSPy: dspy.LM("ollama_chat/qwen3:8b"). LlamaIndex: Ollama(model="qwen3:8b") and OllamaEmbedding(model_name="nomic-embed-text"). Anything OpenAI-shaped works with OLLAMA_API_BASE=http://localhost:11434/v1.
  4. Memory check. Larger models on smaller machines silently fall back to CPU and slow to a crawl. ollama ps shows currently-loaded models and where they're running.
  5. Keep-alive tuning. Default unloads a model 5 minutes after last use. Production pipelines want keep_alive: -1 (load forever) or longer windows.
  6. Tunnel for shared access. The builddaily.io chat-bridge exposes Ollama through a Cloudflare Tunnel + a Python proxy that enforces an ALLOWED_MODELS allowlist and a NEIL_BRIDGE_TOKEN. Mirror that pattern for any other public surface — never expose :11434 directly.

In the GL stack

builddaily.io

  • Chat-bridge runtime. Already runs on Ollama. Allowed models: qwen3:8b, qwen3.6:27b-coding-mxfp8, nomic-embed-text. Token-auth proxied through Cloudflare Tunnel.
  • Slice 1 embedding pipeline. The retrieval upgrade in the agent-stack post uses Ollama's /api/embeddings with nomic-embed-text. Zero new infra — model is already on the allowlist.
  • Slice 4 fine-tune serving. The fine-tuned LoRA-merged 8B model gets ollama create'd and served the same way; the DSPy module config swaps a model name and nothing else changes.

paiddaily.io

  • Local model fallback for the API. Anything classification-shaped (catalyst type, urgency, pool risk band) compiles to a DSPy module that can hit either Claude or Ollama. Ollama is the cost-zero default for development; promote to Claude only where eval shows quality demands.

sagedaily.io

  • Offline / cost-bounded operations. Same DSPy-module shape; the same modules that run against Claude in production can flip to Ollama for cost-bounded paths (long retrospectives, batch summarization).

Gotchas

  • Default context windows are short. Models in Ollama ship with conservative num_ctx defaults (often 2048-4096). For RAG and long-context work, set num_ctx: 8192 (or the model's max) explicitly via Modelfile or per-request options.
  • Quantization affects quality more than you think. Default pulls are Q4_0 or Q4_K_M — fine for most chat workloads, visible quality drop vs Q8 or fp16 on harder reasoning. Pull a higher quant if quality is the bottleneck.
  • Embeddings ≠ chat. /api/embeddings is a different endpoint with a different model-load lifecycle; the embedding model loads independently of the chat model. Both can be resident simultaneously — VRAM permitting.
  • Concurrent requests serialize per-model. Ollama serves one request at a time per loaded model on most setups. High-throughput indexing benefits from running embedding requests in a tight serial loop, not a worker pool.
  • macOS install runs as a user-launchd agent. Stops when you log out. For server use, either don't log out or wire it as a system daemon.

Risks

  • Single-vendor binary. Ollama is open source but the binary distribution and registry are run by one team. Apache-2.0 models stay yours either way; the convenience layer is the dependency. Mitigation: GGUF format is portable — any model pulled via Ollama can be served by llama.cpp directly if Ollama disappears.
  • Quantization quality cliff. Aggressive quantization on small models can degrade structured-output reliability. Build the eval set against the quant you're shipping, not the unquantized weights.
  • Default open daemon = local attack surface. :11434 listening on localhost is fine; exposed on a network it's an unauthenticated chat completion endpoint. Always proxy.

Related

  • DSPy — every brain-layer module that runs locally is one config line away from Ollama.
  • nomic-embed-text — the embedding model served through Ollama for retrieval.
  • bge-reranker — the cross-encoder reranker (runs via sentence-transformers, not Ollama; complementary).