nomic-embed-text
A small, fast, open-weight text-embedding model. 137M parameters, 768-dimensional output, 8192-token context. Runs locally through Ollama — same shape as your existing chat model surface. Free, Apache 2.0-licensed weights. The default embedding choice for any GL corpus that fits on local hardware (i.e., all of them, today).
nomic-embed-text is the embedding model that lets a GL RAG build stay entirely local with no quality trade you'd notice. Already on the chat-bridge's ALLOWED_MODELS list — just not used for retrieval yet. This page is the orient-and-wire-it surface.
What it is
A general-purpose sentence-embedding model from Nomic AI. Open weights (Apache 2.0); 137M parameters; produces 768-dimensional vectors; supports up to 8192-token input — long-document chunks fit cleanly without aggressive splitting. Released as nomic-embed-text-v1.5 (current) with Matryoshka-Representation-Learning: you can truncate the 768-dim vector down to 512 / 256 / 128 dims at retrieval time and trade quality for storage cleanly.
On MTEB (the standard embedding leaderboard) it sits above OpenAI's text-embedding-3-small on retrieval tasks at the same size class — and you run it on your laptop with no API call.
When to use it
Reach for it when:
- The corpus fits on local hardware and the build doesn't have an API-budget mandate to "use OpenAI." For a markdown corpus the size of builddaily.io's, this is always.
- You want zero embedding spend as a hard constraint.
- The context window matters — chunk sizes up to ~3000 tokens fit cleanly into the model's 8192-token input without truncation.
- You're already running Ollama for inference and don't want a separate embedding service.
Skip it when:
- You need a dimension > 768 for some retrieval study. Look at
bge-largeore5-mistralinstead. - The corpus is multilingual-first and you need strong non-English performance — Cohere's
embed-multilingual-v3orbge-m3lead here. - You need proprietary-stack support contracts for compliance reasons — open-weights models don't come with an SLA.
At a glance
Specs
- Architecture — encoder-only transformer (BERT-family).
- Parameters — 137M.
- Output dimension — 768 (truncatable to 512 / 256 / 128 via Matryoshka).
- Max input — 8192 tokens.
- License — Apache 2.0 (commercial use OK).
Distribution
- Ollama —
ollama pull nomic-embed-text→ exposed via/api/embeddingson the local Ollama server. This is the GL default path. sentence-transformers—SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)for direct Python use.llama.cpp/ GGUF — for non-Ollama local hosting; same weights.- Hugging Face Inference API — hosted, free tier, but defeats the "stay local" benefit.
Prompt prefixes (important)
Nomic models expect a task prefix on each input:
search_document: <text>for documents you're indexing.search_query: <text>for the user query you're matching against the index.clustering: <text>for clustering use cases.classification: <text>for classifier features.
Forgetting these silently drops retrieval quality by 5–15%. Most LlamaIndex integrations handle it automatically; raw API calls don't.
How to integrate
Default integration for a GL retrieval build:
- Pull the model.
ollama pull nomic-embed-texton the host running Ollama. ~270MB. - Verify the endpoint.
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"search_document: hello"}'→ should return a 768-element vector. - Wire to LlamaIndex.
from llama_index.embeddings.ollama import OllamaEmbedding→embed_model = OllamaEmbedding(model_name="nomic-embed-text"). Confirm the integration applies thesearch_document:/search_query:prefixes — it does as of recent versions, but verify against a test query. - Set globally.
Settings.embed_model = embed_modelso every index / retrieval uses it without per-call config. - Persist the index. First-time embedding of a corpus is the only slow step; persist to disk (
StorageContext.persist(...)) so subsequent runs are read-only. - Spot-check. Pull a few documents, embed both query and document with explicit prefixes, cosine-similarity them by hand. Sanity-checks the wiring before trusting retrieval scores.
In the GL stack
builddaily.io
- Chat-bridge retrieval upgrade — embedding choice. Replace the no-embed concat-all path with: chunked corpus →
nomic-embed-textvia Ollama → file-backed vector store. Zero new dependencies (model is already on theALLOWED_MODELSlist); zero API spend; same Ollama process the chat already uses. - Resources / Posts / Projects unified index. One persisted vector store covering all of
web/content/; rebuilt nightly or on content PR merge.
paiddaily.io
- Tickers-as-Resources index. 253 ticker pages embedded once; queried at chat time. The 8192-token context window absorbs the typical ticker page in a single chunk.
- Catalyst archive index. Each Pendle catalyst as a
Documentwith(market, date, outcome)metadata; query embedding fused with structured filters.
sagedaily.io
- Astrology / tarot canon index. Vedic dasha entries, transit interpretations, tarot symbolism — currently inline in module prompts. Embed once, retrieve per reading.
- User reading history index (paired with Neo4j). Neo4j answers when; this index answers what's similar.
Gotchas
- Forgetting the task prefix tanks retrieval quality. Document gets
search_document:, query getssearch_query:. Easy to miss; silent failure mode. - Matryoshka truncation needs L2-normalization first. If you truncate from 768 to 256 dims, normalize the truncated vector — not the full one. Most libraries handle this; verify in a sanity test.
- The Ollama embeddings endpoint is rate-limited differently than chat. High-throughput indexing should batch through the Python
ollamapackage rather than the chat-bridge proxy. - Don't mix prefix-aware and prefix-naive corpora in one index. Re-index from scratch when changing the prefix convention.
Risks
- Single-vendor research-shop output. Nomic AI is a small company. Apache-licensed weights mean you keep them even if Nomic disappears, but a model upgrade path depends on them continuing to ship.
- Quality ceiling vs frontier embedders.
text-embedding-3-smallandbge-largeboth edge it out on some benchmarks. For a markdown corpus where retrieval quality dominates (e.g., scientific literature), worth A/B before locking in.
Related
- LlamaIndex — the framework that hosts this model in the retrieval pipeline.
OllamaEmbeddingwires the two together. - bge-reranker — the natural companion downstream. Embeddings retrieve top-k=20; reranker collapses to top-n=5 before answer synthesis.
