Build Daily

Tinley Park · May 29, 2026
toolBAAI (Beijing Academy of AI)watching

bge-reranker

An open-weight cross-encoder reranker — the second stage of any serious RAG pipeline. Embeddings retrieve top-k=20 candidates; the reranker reads the query and each candidate jointly and re-scores. Runs locally via sentence-transformers; Apache 2.0; the free counterpart to Cohere Rerank API.

Updated May 24, 2026

bge-reranker is the cheap, local, Apache-2.0 way to add the second-stage reranker every "production-quality" RAG checklist mentions and most projects skip. This page is the orient-and-wire-it surface. Official model card on Hugging Face owns the inference-time contract.

What it is

A family of cross-encoder models from BAAI for re-ranking retrieved passages. Where an embedding model produces a vector per text and you cosine-similarity them after the fact, a cross-encoder reads (query, passage) together and outputs a single relevance score. The joint read is more accurate; the cost is you can't pre-compute it — every query × every candidate is a fresh forward pass.

The pipeline pattern: embeddings get you to top-k=20 cheaply (vector math over a pre-built index). The reranker reads those 20 jointly with the query and re-scores. You keep top-n=5 for the answer step. The two stages each play to their strengths.

Three sizes: bge-reranker-base (278M params, English), bge-reranker-large (560M, English), bge-reranker-v2-m3 (568M, multilingual + long context). All Apache 2.0.

When to use it

Reach for it when:

  • The corpus is medium-to-large and embedding-only retrieval is letting near-misses through (the cited passage is sometimes the second- or third-best hit, not the top one).
  • You need better top-n quality without changing the embedding model or the index.
  • You want zero spend on retrieval — Cohere Rerank is the alternative but costs per call.
  • The latency budget can absorb 20–100ms per query for the rerank step on CPU (much less on GPU).

Skip it when:

  • The retrieval recall is the problem (the right passage never gets to top-20). Fix the embedding model or chunking first; rerankers can only re-order what retrieval surfaced.
  • The corpus is tiny (< ~200 chunks). Stuffing top-20 retrieved chunks directly into the LM context is often fine; the rerank step doesn't pay back the latency.
  • You're latency-constrained below ~150ms per query end-to-end. Add it later when other knobs are exhausted.

At a glance

Specs

  • Architecture — cross-encoder (transformer reads [CLS] query [SEP] passage [SEP] jointly).
  • bge-reranker-base — 278M params, ~50ms per pair on CPU, English.
  • bge-reranker-large — 560M params, ~100ms per pair on CPU, English.
  • bge-reranker-v2-m3 — 568M params, multilingual, supports up to 8K-token passages.
  • License — Apache 2.0.

Distribution

  • sentence-transformersfrom sentence_transformers import CrossEncodermodel = CrossEncoder("BAAI/bge-reranker-base")model.predict([(query, passage)]). GL default path.
  • Hugging Face Transformers — direct PyTorch / AutoModelForSequenceClassification for finer control.
  • FlagEmbedding (BAAI's own SDK) — purpose-built; thin wrapper over Transformers.

Inference shape

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
# pairs is [(query, passage1), (query, passage2), ...]
scores = reranker.predict(pairs)
# scores[i] is the relevance of passage i to the query — sort desc

The score is unnormalized; higher = better. Don't try to threshold absolutely — use the score order to pick top-n.

How to integrate

Default integration order:

  1. Install. pip install sentence-transformers (pulls torch). One-time ~2GB download for the model on first use; cached locally afterwards.
  2. Pick the size. bge-reranker-base is the default — fast on CPU, "good enough" for almost every English RAG. Go large only if eval shows the base model is the bottleneck.
  3. Wire to LlamaIndex. from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerankrerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5) → add to the node_postprocessors list on the query engine.
  4. Set retrieval to over-fetch. Configure the retriever to return top_k=20 (or 30) so the reranker has candidates to re-order. Embedding retrieval is cheap; over-fetching costs almost nothing.
  5. Benchmark. Build a small eval set of (query, gold_passage_id) pairs. Measure recall@5 and mrr@5 with and without the reranker. The lift should be visible — 5–15 points on a well-tuned corpus.
  6. Pin the model version. BAAI/bge-reranker-base resolves to whatever's tagged main on Hugging Face. Pin to a specific commit / revision so reproducibility doesn't drift.

In the GL stack

builddaily.io

  • Chat-bridge retrieval pipeline. Final stage after nomic-embed-text retrieval. Top-k=20 → rerank → top-n=5 → DSPy answer module. Catches the cases where "Sage" and "Sage Daily" both match a query about the product but only one is the right passage.
  • Resources index search. Same shape, smaller corpus. The reranker matters more here because a single-page-quality answer depends on getting the exact right resource page first.

paiddaily.io

  • Tickers archive search. "Which tickers have a strong pre-earnings setup" returns 20 plausible candidates from embedding similarity; the reranker reads the query and each ticker page jointly to score actual fit.
  • Catalyst archive precedent search. Embedding gets you "Pendle catalysts near this date"; reranker gets you "Pendle catalysts that actually match this market shape."

sagedaily.io

  • Astrology canon retrieval per reading. Each card or transit pulls ~20 candidate canonical passages; reranker collapses to the 3–5 most relevant for grounding the DSPy reading module. Quality matters more than throughput here — readings are low-volume, high-stakes per call.

Gotchas

  • Reranker only improves precision, not recall. If the right passage isn't in the top-k from embedding retrieval, no amount of reranking surfaces it. Fix recall first.
  • Latency adds linearly with k. Reranking top-20 is ~20 forward passes. Don't over-fetch beyond what your latency budget tolerates; top-k=20–30 is the sweet spot.
  • Score scales aren't comparable across models. A score of 0.7 from bge-reranker-base doesn't mean the same thing as 0.7 from a Cohere rerank. Don't hard-code thresholds; use rank order.
  • Cross-encoder ≠ bi-encoder. Don't try to "embed" with a cross-encoder; the architecture doesn't produce passage embeddings. They are different model classes.
  • First inference is slow. Model load is several seconds; subsequent calls are fast. Warm the model on process start if running in a request-response pattern.

Risks

  • GPU helps a lot. CPU is fine for 20-pair reranks at chat-response cadence; for batch indexing or larger top-k, plan around GPU availability.
  • Single-research-shop weights. BAAI is a research institute; the v2 line is active but a version cadence change could leave you on an unmaintained checkpoint. Apache-2.0 means you keep the weights either way.
  • English-first. bge-reranker-base and bge-reranker-large are English-focused. Use bge-reranker-v2-m3 for multilingual or long-context (8K passages).

Related

  • LlamaIndex — the framework that hosts the reranker as a NodePostprocessor.
  • nomic-embed-text — the natural first stage. Embeddings retrieve top-k=20; the reranker re-orders.