bge-reranker
An open-weight cross-encoder reranker — the second stage of any serious RAG pipeline. Embeddings retrieve top-k=20 candidates; the reranker reads the query and each candidate jointly and re-scores. Runs locally via sentence-transformers; Apache 2.0; the free counterpart to Cohere Rerank API.
bge-reranker is the cheap, local, Apache-2.0 way to add the second-stage reranker every "production-quality" RAG checklist mentions and most projects skip. This page is the orient-and-wire-it surface. Official model card on Hugging Face owns the inference-time contract.
What it is
A family of cross-encoder models from BAAI for re-ranking retrieved passages. Where an embedding model produces a vector per text and you cosine-similarity them after the fact, a cross-encoder reads (query, passage) together and outputs a single relevance score. The joint read is more accurate; the cost is you can't pre-compute it — every query × every candidate is a fresh forward pass.
The pipeline pattern: embeddings get you to top-k=20 cheaply (vector math over a pre-built index). The reranker reads those 20 jointly with the query and re-scores. You keep top-n=5 for the answer step. The two stages each play to their strengths.
Three sizes: bge-reranker-base (278M params, English), 560M, English), bge-reranker-large (bge-reranker-v2-m3 (568M, multilingual + long context). All Apache 2.0.
When to use it
Reach for it when:
- The corpus is medium-to-large and embedding-only retrieval is letting near-misses through (the cited passage is sometimes the second- or third-best hit, not the top one).
- You need better top-n quality without changing the embedding model or the index.
- You want zero spend on retrieval — Cohere Rerank is the alternative but costs per call.
- The latency budget can absorb 20–100ms per query for the rerank step on CPU (much less on GPU).
Skip it when:
- The retrieval recall is the problem (the right passage never gets to top-20). Fix the embedding model or chunking first; rerankers can only re-order what retrieval surfaced.
- The corpus is tiny (< ~200 chunks). Stuffing top-20 retrieved chunks directly into the LM context is often fine; the rerank step doesn't pay back the latency.
- You're latency-constrained below ~150ms per query end-to-end. Add it later when other knobs are exhausted.
At a glance
Specs
- Architecture — cross-encoder (transformer reads
[CLS] query [SEP] passage [SEP]jointly). bge-reranker-base— 278M params, ~50ms per pair on CPU, English.bge-reranker-large— 560M params, ~100ms per pair on CPU, English.bge-reranker-v2-m3— 568M params, multilingual, supports up to 8K-token passages.- License — Apache 2.0.
Distribution
sentence-transformers—from sentence_transformers import CrossEncoder→model = CrossEncoder("BAAI/bge-reranker-base")→model.predict([(query, passage)]). GL default path.- Hugging Face Transformers — direct PyTorch /
AutoModelForSequenceClassificationfor finer control. - FlagEmbedding (BAAI's own SDK) — purpose-built; thin wrapper over Transformers.
Inference shape
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
# pairs is [(query, passage1), (query, passage2), ...]
scores = reranker.predict(pairs)
# scores[i] is the relevance of passage i to the query — sort desc
The score is unnormalized; higher = better. Don't try to threshold absolutely — use the score order to pick top-n.
How to integrate
Default integration order:
- Install.
pip install sentence-transformers(pulls torch). One-time ~2GB download for the model on first use; cached locally afterwards. - Pick the size.
bge-reranker-baseis the default — fast on CPU, "good enough" for almost every English RAG. Go large only if eval shows the base model is the bottleneck. - Wire to LlamaIndex.
from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerank→rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5)→ add to thenode_postprocessorslist on the query engine. - Set retrieval to over-fetch. Configure the retriever to return
top_k=20(or 30) so the reranker has candidates to re-order. Embedding retrieval is cheap; over-fetching costs almost nothing. - Benchmark. Build a small eval set of
(query, gold_passage_id)pairs. Measurerecall@5andmrr@5with and without the reranker. The lift should be visible — 5–15 points on a well-tuned corpus. - Pin the model version.
BAAI/bge-reranker-baseresolves to whatever's taggedmainon Hugging Face. Pin to a specific commit / revision so reproducibility doesn't drift.
In the GL stack
builddaily.io
- Chat-bridge retrieval pipeline. Final stage after nomic-embed-text retrieval. Top-k=20 → rerank → top-n=5 → DSPy answer module. Catches the cases where "Sage" and "Sage Daily" both match a query about the product but only one is the right passage.
- Resources index search. Same shape, smaller corpus. The reranker matters more here because a single-page-quality answer depends on getting the exact right resource page first.
paiddaily.io
- Tickers archive search. "Which tickers have a strong pre-earnings setup" returns 20 plausible candidates from embedding similarity; the reranker reads the query and each ticker page jointly to score actual fit.
- Catalyst archive precedent search. Embedding gets you "Pendle catalysts near this date"; reranker gets you "Pendle catalysts that actually match this market shape."
sagedaily.io
- Astrology canon retrieval per reading. Each card or transit pulls ~20 candidate canonical passages; reranker collapses to the 3–5 most relevant for grounding the DSPy reading module. Quality matters more than throughput here — readings are low-volume, high-stakes per call.
Gotchas
- Reranker only improves precision, not recall. If the right passage isn't in the top-k from embedding retrieval, no amount of reranking surfaces it. Fix recall first.
- Latency adds linearly with k. Reranking top-20 is ~20 forward passes. Don't over-fetch beyond what your latency budget tolerates; top-k=20–30 is the sweet spot.
- Score scales aren't comparable across models. A score of 0.7 from
bge-reranker-basedoesn't mean the same thing as 0.7 from a Cohere rerank. Don't hard-code thresholds; use rank order. - Cross-encoder ≠ bi-encoder. Don't try to "embed" with a cross-encoder; the architecture doesn't produce passage embeddings. They are different model classes.
- First inference is slow. Model load is several seconds; subsequent calls are fast. Warm the model on process start if running in a request-response pattern.
Risks
- GPU helps a lot. CPU is fine for 20-pair reranks at chat-response cadence; for batch indexing or larger top-k, plan around GPU availability.
- Single-research-shop weights. BAAI is a research institute; the v2 line is active but a version cadence change could leave you on an unmaintained checkpoint. Apache-2.0 means you keep the weights either way.
- English-first.
bge-reranker-baseandbge-reranker-largeare English-focused. Usebge-reranker-v2-m3for multilingual or long-context (8K passages).
Related
- LlamaIndex — the framework that hosts the reranker as a
NodePostprocessor. - nomic-embed-text — the natural first stage. Embeddings retrieve top-k=20; the reranker re-orders.
