toolBAAI (Beijing Academy of AI)watching

bge-reranker

An open-weight cross-encoder reranker — the second stage of any serious RAG pipeline. Embeddings retrieve top-k=20 candidates; the reranker reads the query and each candidate jointly and re-scores. Runs locally via sentence-transformers; Apache 2.0; the free counterpart to Cohere Rerank API.

Updated May 24, 2026

bge-reranker is the cheap, local, Apache-2.0 way to add the second-stage reranker every "production-quality" RAG checklist mentions and most projects skip. This page is the orient-and-wire-it surface. Official model card on Hugging Face owns the inference-time contract.

What it is

A family of cross-encoder models from BAAI for re-ranking retrieved passages. Where an embedding model produces a vector per text and you cosine-similarity them after the fact, a cross-encoder reads (query, passage) together and outputs a single relevance score. The joint read is more accurate; the cost is you can't pre-compute it — every query × every candidate is a fresh forward pass.

The pipeline pattern: embeddings get you to top-k=20 cheaply (vector math over a pre-built index). The reranker reads those 20 jointly with the query and re-scores. You keep top-n=5 for the answer step. The two stages each play to their strengths.

Three sizes: bge-reranker-base (~~278M params, English), bge-reranker-large (~~560M, English), bge-reranker-v2-m3 (568M, multilingual + long context). All Apache 2.0.

When to use it

Reach for it when:

The corpus is medium-to-large and embedding-only retrieval is letting near-misses through (the cited passage is sometimes the second- or third-best hit, not the top one).
You need better top-n quality without changing the embedding model or the index.
You want zero spend on retrieval — Cohere Rerank is the alternative but costs per call.
The latency budget can absorb 20–100ms per query for the rerank step on CPU (much less on GPU).

Skip it when:

The retrieval recall is the problem (the right passage never gets to top-20). Fix the embedding model or chunking first; rerankers can only re-order what retrieval surfaced.
The corpus is tiny (< ~200 chunks). Stuffing top-20 retrieved chunks directly into the LM context is often fine; the rerank step doesn't pay back the latency.
You're latency-constrained below ~150ms per query end-to-end. Add it later when other knobs are exhausted.

At a glance

Specs

Architecture — cross-encoder (transformer reads [CLS] query [SEP] passage [SEP] jointly).
bge-reranker-base — 278M params, ~50ms per pair on CPU, English.
bge-reranker-large — 560M params, ~100ms per pair on CPU, English.
bge-reranker-v2-m3 — 568M params, multilingual, supports up to 8K-token passages.
License — Apache 2.0.

Distribution

sentence-transformers — from sentence_transformers import CrossEncoder → model = CrossEncoder("BAAI/bge-reranker-base") → model.predict([(query, passage)]). GL default path.
Hugging Face Transformers — direct PyTorch / AutoModelForSequenceClassification for finer control.
FlagEmbedding (BAAI's own SDK) — purpose-built; thin wrapper over Transformers.

Inference shape

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
# pairs is [(query, passage1), (query, passage2), ...]
scores = reranker.predict(pairs)
# scores[i] is the relevance of passage i to the query — sort desc

The score is unnormalized; higher = better. Don't try to threshold absolutely — use the score order to pick top-n.

How to integrate

Default integration order:

Install. pip install sentence-transformers (pulls torch). One-time ~2GB download for the model on first use; cached locally afterwards.
Pick the size. bge-reranker-base is the default — fast on CPU, "good enough" for almost every English RAG. Go large only if eval shows the base model is the bottleneck.
Wire to LlamaIndex. from llama_index.postprocessor.sbert_rerank import SentenceTransformerRerank → rerank = SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5) → add to the node_postprocessors list on the query engine.
Set retrieval to over-fetch. Configure the retriever to return top_k=20 (or 30) so the reranker has candidates to re-order. Embedding retrieval is cheap; over-fetching costs almost nothing.
Benchmark. Build a small eval set of (query, gold_passage_id) pairs. Measure recall@5 and mrr@5 with and without the reranker. The lift should be visible — 5–15 points on a well-tuned corpus.
Pin the model version. BAAI/bge-reranker-base resolves to whatever's tagged main on Hugging Face. Pin to a specific commit / revision so reproducibility doesn't drift.

In the GL stack

builddaily.io

Chat-bridge retrieval pipeline. Final stage after nomic-embed-text retrieval. Top-k=20 → rerank → top-n=5 → DSPy answer module. Catches the cases where "Sage" and "Sage Daily" both match a query about the product but only one is the right passage.
Resources index search. Same shape, smaller corpus. The reranker matters more here because a single-page-quality answer depends on getting the exact right resource page first.

paiddaily.io

Tickers archive search. "Which tickers have a strong pre-earnings setup" returns 20 plausible candidates from embedding similarity; the reranker reads the query and each ticker page jointly to score actual fit.
Catalyst archive precedent search. Embedding gets you "Pendle catalysts near this date"; reranker gets you "Pendle catalysts that actually match this market shape."

sagedaily.io

Astrology canon retrieval per reading. Each card or transit pulls ~20 candidate canonical passages; reranker collapses to the 3–5 most relevant for grounding the DSPy reading module. Quality matters more than throughput here — readings are low-volume, high-stakes per call.

Gotchas

Reranker only improves precision, not recall. If the right passage isn't in the top-k from embedding retrieval, no amount of reranking surfaces it. Fix recall first.
Latency adds linearly with k. Reranking top-20 is ~20 forward passes. Don't over-fetch beyond what your latency budget tolerates; top-k=20–30 is the sweet spot.
Score scales aren't comparable across models. A score of 0.7 from bge-reranker-base doesn't mean the same thing as 0.7 from a Cohere rerank. Don't hard-code thresholds; use rank order.
Cross-encoder ≠ bi-encoder. Don't try to "embed" with a cross-encoder; the architecture doesn't produce passage embeddings. They are different model classes.
First inference is slow. Model load is several seconds; subsequent calls are fast. Warm the model on process start if running in a request-response pattern.

Risks

GPU helps a lot. CPU is fine for 20-pair reranks at chat-response cadence; for batch indexing or larger top-k, plan around GPU availability.
Single-research-shop weights. BAAI is a research institute; the v2 line is active but a version cadence change could leave you on an unmaintained checkpoint. Apache-2.0 means you keep the weights either way.
English-first. bge-reranker-base and bge-reranker-large are English-focused. Use bge-reranker-v2-m3 for multilingual or long-context (8K passages).

Alternatives · 5 substitutesPick bge-reranker unless one of these wins on your specific brief.

01
Cohere Rerank API
Hosted, single-call rerank; ~$1 per 1k searches.
Wins when ▸budget supports it and the team doesn't want to host the model. Top-of-leaderboard quality, dead-simple API, multilingual. Pay-per-call surface; defeats "stay free / stay local."
02
ms-marco-MiniLM-L-12-v2
Tiny (~33M params) cross-encoder, trained on MS-MARCO.
Wins when ▸extreme latency / footprint constraints — CPU-only edge boxes, mobile, embedded. ~10× faster than bge-base; meaningfully lower quality on out-of-distribution queries.
03
mxbai-rerank-large · Mixedbread AI
Open-weight, claims SOTA on BEIR; same deploy story as bge.
Wins when ▸you want a peer alternative to bge from a different research lab. Quality is close; the call between them is usually noise unless you A/B on your own corpus.
04
ColBERT / ColBERTv2
Late-interaction retriever — between embedding and cross-encoder.
Wins when ▸you want one-stage retrieval that's higher quality than vector cosine but cheaper than full rerank. Different paradigm; trades query-time cost for index-time storage cost. Not a drop-in for bge.
05
LLM-as-reranker
Ask a generative LM to rank passages by relevance.
Wins when ▸the relevance signal is contextual in ways a small cross-encoder misses ("which of these passages best fits a Build Daily voice answer"). 10–50× more expensive than bge; reach for it when the rerank quality is provably the bottleneck.

LlamaIndex — the framework that hosts the reranker as a NodePostprocessor.
nomic-embed-text — the natural first stage. Embeddings retrieve top-k=20; the reranker re-orders.