sentence-transformers
The Python library that hosts every common bi-encoder (sentence embeddings) and cross-encoder (rerankers) behind one consistent API. Apache 2.0; pulls weights from Hugging Face; pairs with bge-reranker as the runtime. The lowest-friction way to add embeddings or reranking to a build that isn't on Ollama.
sentence-transformers is the library every GL retrieval upgrade routes through when Ollama isn't the right host. Embeddings, rerankers, cross-encoders — all wrapped in one consistent API. This page is the orient-and-anchor surface; official docs at sbert.net own the API contract.
What it is
A Python framework wrapping transformers with embedding-and-reranking-specific abstractions. Two model classes:
SentenceTransformer— bi-encoders that produce a vector per input. Cosine-similarity afterward to rank candidates. Fast at retrieval time because the index is pre-computed.CrossEncoder— joint encoders that read(query, candidate)together and produce a single relevance score. Slow per pair (no pre-compute possible) but much higher quality. The reranker class.
Same from_pretrained-style instantiation as transformers. Apache 2.0. pip install sentence-transformers pulls the lib + torch + transformers as deps.
The pitch: every common embedding or reranker model on Hugging Face — bge-*, e5-*, mxbai-*, gte-*, the ms-marco-MiniLM-* family — works through this one library with the same code shape. Trade transformers-direct flexibility for a simpler API tuned to the embeddings + reranking workload.
When to use it
Reach for it when:
- You need a reranker (cross-encoder) in your retrieval pipeline. There's no Ollama-native cross-encoder serving today; this is the canonical local path.
- You need embeddings outside of Ollama — running on a GPU, batching at scale, or using a model not in the Ollama registry.
- You want batch embedding with high throughput —
SentenceTransformer.encode(batch_size=64, ...)is significantly faster than serial API calls. - You need fine-tuning of an embedding model on your own corpus (training pairs of
(query, positive_doc, negative_doc)).
Skip it when:
- Ollama already hosts the embedding model and the call surface is fine — keep things on one daemon.
- The reranker you want has a cleaner standalone wrapper (e.g., Cohere's hosted API for
CohereRerank). - You're not yet doing any retrieval — install when slice 1 lands, not before.
At a glance
Core surface
from sentence_transformers import SentenceTransformer, CrossEncoder
# Bi-encoder for embeddings
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
vectors = embedder.encode(["passage 1", "passage 2"]) # shape: (2, 768)
# Cross-encoder for reranking
reranker = CrossEncoder("BAAI/bge-reranker-base")
scores = reranker.predict([("query", "passage 1"), ("query", "passage 2")])
What it handles
- Tokenization — auto-loaded with the model.
- Batching —
encode(batch_size=...)handles GPU memory tiling. - Pooling — mean / CLS / max pooling for sentence embeddings, abstracted per model.
- Normalization —
normalize_embeddings=True(default for most) returns unit vectors. - Device handling —
model = model.to("cuda")or"mps"or"cpu". Auto-detected by default.
What it doesn't handle
- Serving as a daemon — sentence-transformers is a library, not a server. Wrap in FastAPI if you want an HTTP surface.
- Quantized GGUF — the GGUF format is Ollama / llama.cpp territory; sentence-transformers loads native safetensors.
- Production batching across requests — single-process, single-batch. For multi-tenant high-throughput, look at
text-embeddings-inference(Hugging Face's serving binary) or a custom batched server.
How to integrate
Default integration for a GL retrieval pipeline:
- Install once.
pip install sentence-transformers. Pulls ~2GB of torch deps; pin in a dedicated venv per service. - Load the models you need. Cache hits on subsequent runs are fast; cold cache pulls weights from Hugging Face on first use.
- Wire into LlamaIndex. For embeddings:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding→embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5"). For rerankers:from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerank→ add tonode_postprocessors. - Batch for indexing. First-time indexing of a corpus is the only slow step. Use
embedder.encode(docs, batch_size=64, show_progress_bar=True)to maximize GPU saturation. CPU still works, just slower. - Persist the index. Embeddings should be computed once and stored. LlamaIndex
StorageContext.persist(...)or a direct dump to disk. - Warm on startup. First inference loads the model (several seconds). For request-response paths, load at process start, not on first request.
In the GL stack
builddaily.io
- Reranker runtime (slice 1). bge-reranker runs through
sentence-transformers'CrossEncoderclass inside the LlamaIndex query engine as the second-stage retrieval upgrade. - Backup embedder. If nomic-embed-text via Ollama hits a quality ceiling, the
SentenceTransformer("BAAI/bge-base-en-v1.5")swap is one line of code — same LlamaIndex interface.
paiddaily.io
- Tickers / Catalysts embedding pipeline. For datasets large enough that Ollama serial inference becomes the bottleneck, batched embedding through sentence-transformers + a GPU pulls indexing time down by 10–50×.
- Cross-encoder over precedent matches. "Show me Pendle catalysts that precede a setup like today's" — embedding gets you to top-20 candidates; a cross-encoder over the top-20 picks the actual precedents.
sagedaily.io
- Astrology canon embedding. Indexing the canonical text once — batched, GPU if available, sentence-transformers is the right shape.
- Per-reading reranker. When the canon retrieval grows past simple cosine match, a
CrossEncoderis what re-orders.
Gotchas
- Different models want different prefixes. Some (
bge-*,e5-*,nomic-*) expect task-prefixed input (query:/passage:or similar); sentence-transformers does not apply these for you. The model card says which. - Default pooling differs by model. Most use mean pooling; some use CLS. The library handles this from the model config, but custom training paths need to match.
- PyTorch is the dependency. Heavy install. If your service is small and you only need one embedding model occasionally, the cost might dwarf the value — consider stuffing the call through Ollama instead.
- MPS (Apple Silicon GPU) works but isn't as fast as Metal-native. For Apple Silicon, Ollama (which uses Metal natively) is often faster on the same model.
- Don't normalize twice. Some models output normalized vectors by default; check
normalize_embeddingsand the model card.
Risks
- Single-library concentration in the embedding ecosystem. sentence-transformers has become the standard; most embedding model authors target its API. A pivot would ripple, but the underlying
transformerslibrary still works. - Maintainer transition. The original author (Nils Reimers) moved on; the project is now community-maintained. Pace is slower; some PRs sit. Pin versions.
Related
- bge-reranker — the cross-encoder this library hosts in the GL retrieval pipeline.
- Hugging Face — the registry sentence-transformers pulls weights from.
- Ollama — the embedding serving alternative when the model is in the Ollama registry.
