RERANKER · TOOL COMPARISON
Rerankers compared: Cohere, BGE, Jina, Voyage, ColBERT, mxbai, Mistral, sentence-transformers, RankGPT, FlashRank
Ten reranker options, four selection axes, +15-30% recall lift for RAG pipelines. As of May 2026.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is a reranker?
A reranker is a two-stage search pattern: stage 1 quickly pulls e.g. 50 candidates from a vector database. Stage 2 scores each of those 50 pairwise with a cross-encoder model – it sees question and candidate together and produces a more precise score. The top 3 or 5 then go to the language model. The trick: stage 1 is fast but rough; stage 2 is slower but markedly more accurate. Combined, almost every case study reports 15-30% more Recall@5 than pure vector search.
In typical Swiss fiduciary RAG setups (May 2026) the reranker is one of the largest levers for answer quality – bigger than switching the embedding model, smaller only than clean chunking. Still it is often skipped, because it adds 100-500 ms latency and requires either an API binding or your own model on a GPU.
In May 2026 about ten serious options exist. Four are pure APIs (Cohere, Voyage, Mistral plus mini-models on Together/Bedrock), five are self-host-capable (BGE-Reranker-v2, Jina, mxbai, ColBERT, sentence-transformers), one is LLM-based (RankGPT/RankZephyr), and one is built for latency (FlashRank).
Why reranking matters
Four axes decide suitability: quality gain, latency budget, hosting model, and cost. Whoever adds a reranker buys accuracy with time and money – and must decide how much of each they are willing to spend.
Quality gain: On BEIR (BEIR benchmark, as of 2026) a Cohere Rerank-3 lifts nDCG@10 over plain BM25 by roughly 30%, over dense retrieval alone by 12-18%. For law mandates, where the correct precedent may sit between position 4 and 5, this is the difference between usable and unusable.
Latency: Reranking a 50-candidate set typically takes 100-400 ms on a GPU or via a fast API. FlashRank does this on CPU in under 100 ms. RankGPT (LLM-based) needs 1-3 seconds – often too long for interactive use.
Hosting: As with embeddings, reranking applies to questions and document passages. For professional-secrecy mandates an API reranker (Cohere, Voyage) implies data transfer to a third country. Self-host (BGE-Reranker-v2, Jina, mxbai-rerank) solves this. Cohere via AWS Bedrock eu-central-1 is a middle ground.
Cost: Cohere Rerank costs USD 2 per 1,000 queries in May 2026. Voyage rerank-2 at USD 0.05 per 1,000 queries is much cheaper. Mistral Rerank EUR 0.40 per 1,000 queries. For a fiduciary office at 200 queries/month, Cohere is USD 0.40, Voyage USD 0.01, Mistral EUR 0.08 – all negligible. At 100,000 queries/month we are looking at USD 200, USD 5, and EUR 40 – there the choice matters.
The ten options in detail
Cohere Rerank (proprietary API, USD 2/1k queries, via AWS Bedrock also EU): rerank-multilingual-v3.0. Industry standard in May 2026, wins almost every benchmark, very good German and French quality. Bedrock hosting in eu-central-1 allows EU residency. Default pick when API is OK.
BGE-Reranker-v2 (BAAI, Apache 2.0, self-host): bge-reranker-v2-m3, multilingual, quality very close to Cohere. Runs comfortably on GPU, on CPU with patience via ONNX. Free open-source standard for on-prem RAG.
Jina Reranker (jina-reranker-v2, Apache 2.0 + cloud, Berlin): multilingual, EU cloud Frankfurt, also self-host. Berlin vendor – EU data protection native. Attractive for clients who want a European API provider.
Voyage Rerank (rerank-2, proprietary API, USD 0.05/1k queries): very strong in benchmarks and by far the cheapest. US hosting by default, EU via AWS Bedrock. For English-heavy setups with cost focus.
ColBERT / ColBERTv2 (Stanford, MIT licence, self-host): late-interaction model, goes a different way – stores a small vector per token and matches token-pairwise. Very accurate but storage-heavy (10-100x storage of dense embeddings). Niche, but top quality.
mxbai-rerank-large-v1 (MixedBread AI, Apache 2.0, self-host): 1024-dim cross-encoder, Apache 2.0, ONNX-capable. Quality decent, compact enough for mid-tier hardware. Solid mid-tier self-host.
Mistral Rerank (proprietary EU API, La Plateforme Paris, EUR 0.40/1k queries): EU-native reranking, DPA easy. Younger than Cohere, quality climbing. Friendly in the EU AI Act context.
sentence-transformers cross-encoder (Apache 2.0, self-host): the classic ms-marco cross-encoder family (ms-marco-MiniLM-L-6-v2 and relatives). Now old, but if your stack already uses sentence-transformers it is 5 lines of code. English-dominant, only acceptable on German.
RankGPT / RankZephyr (self-host, LLM-based): uses its own language model to re-sort the candidate list. Very accurate but slow (1-3 s) and expensive (LLM token cost). Useful for offline reranking of large corpora, not for live queries.
FlashRank (MIT licence, self-host): ultra-fast reranker via ONNX runtime, several small cross-encoder variants. Under 100 ms on CPU. Quality slightly below BGE-Reranker-v2, but latency beats everything. First choice when sub-100ms is mandatory.
Selection workflow in 6 steps
- 01Measure baseline: Recall@5 and nDCG@10 without reranker on an eval suite (30-50 Q/A pairs).
- 02Define latency budget: how many ms may the reranker cost at most? < 100ms FlashRank, < 500ms BGE/Cohere/Voyage, > 1s only RankGPT.
- 03Check hosting constraint: professional secrecy → self-host (BGE-Reranker-v2, mxbai-rerank, Jina-self-host). EU OK → Cohere Bedrock, Mistral, Jina-cloud.
- 04Language profile: German/French dominant → BGE-Reranker-v2-m3, Cohere Rerank-multilingual or Jina-v2. English dominant → Voyage rerank-2.
- 05Test top 3: A/B test 3 reranker candidates on the same eval suite. Document Recall@5 gain.
- 06Calibrate top-K: stage 1 typically fetches 30-50 candidates, stage 2 ranks them, top 3 or 5 go to the LLM. Find the optimum experimentally.
Recommendation by use-case
Swiss law/fiduciary, on-prem mandatory, standard latency OK: BGE-Reranker-v2-m3 self-hosted on the same GPU as BGE-M3 embedding. Free, multilingual, best open-source quality.
Swiss SME, EU hosting OK, top quality: Cohere Rerank via AWS Bedrock eu-central-1. DPA via AWS, industry standard, very good on German.
EU vendor wanted, API: Jina Reranker (Berlin vendor) or Mistral Rerank (Paris). Both DPA-friendly, EU hosting native.
Massive queries, cost focus, English-heavy: Voyage rerank-2. 40x cheaper than Cohere, quality very close.
Sub-100ms latency required, self-host, mid quality OK: FlashRank on CPU. Live-chat setups where the user will not wait.
Maximum accuracy, latency irrelevant (offline reranking): RankGPT with GPT-4o or Claude Sonnet. For batch pipelines (daily indexing, not live).
Tier-plus setup: dense + sparse + reranker: BGE-M3 embedding + BM25 + BGE-Reranker-v2 as hybrid. Often the top combination on BEIR benchmarks.
When reranking can be skipped
If your retrieval Recall@5 without reranker is already above 85% – typical for small, cleanly structured corpora with few similar documents – the latency overhead is not worth it. Measure first, add second.
If your latency requirement is below 200 ms end-to-end (voice agent, live chat with typing indicator), API-based reranking is delicate. Local FlashRank can manage; everything else is not reliable.
If your questions are really keywords (invoice number, client name, document date), semantic reranking is overkill. BM25 plus exact filters works better there.
If you cannot A/B test – measure the effect – do not add a reranker. Without comparison numbers you do not know whether it helps. A small eval suite with 30-50 Q/A pairs is enough: measure without reranker, then with, document the delta.
Trade-offs
STRENGTHS
- Cohere Rerank: industry standard, best quality on German and French
- BGE-Reranker-v2: best open-source model, Apache 2.0, multilingual
- Voyage Rerank: 40x cheaper than Cohere at very good quality
- FlashRank: only reranker with reliable sub-100ms CPU latency
WEAKNESSES
- Cohere/Voyage API: third-country data transfer (unless via AWS Bedrock EU)
- Self-host BGE/Jina: GPU maintenance, model updates, version pinning required
- RankGPT: 1-3 s latency, LLM token cost – only for offline batches
- ColBERT: 10-100x storage overhead compared to dense embeddings
FAQ
How much better do answers really get with reranking?
On BEIR benchmarks 2025/2026, Cohere Rerank-3 lifts nDCG@10 over dense-only retrieval by 12-18%, over BM25-only by 25-30%. BGE-Reranker-v2-m3 sits 2-4 points behind Cohere. In a concrete fiduciary setup with 5,000 documents we measured +18% Recall@5 – correct answers without hallucination went from 78% to 89%.
Do I need GPU for self-host reranking?
For acceptable latency: yes. BGE-Reranker-v2-m3 on an RTX 3060 ranks 50 candidates in ~150 ms. On pure CPU it is 2-5 seconds. FlashRank is the CPU-friendly alternative – smaller models, ONNX-optimised, sub-100 ms on a modern Xeon/EPYC.
How do I combine reranking with hybrid (dense + sparse) search?
Default pattern May 2026: stage 1a (dense BGE-M3, k=30) + stage 1b (BM25 via Tantivy/Elasticsearch, k=30) → reciprocal rank fusion to k=50 → stage 2 BGE-Reranker-v2 ranks to top 5. Often the top configuration on BEIR. Cost: higher because of two indexes; quality: clearly better, especially for queries with proper names.
Does reranking work on French and Italian too?
Yes, but only with multilingual models. BGE-Reranker-v2-m3, Cohere rerank-multilingual-v3, Mistral Rerank, and Jina-v2 are tested multilingual. ms-marco-MiniLM and FlashRank defaults are English-centric – too weak for DE/FR/IT corpora. Always test rerankers on an eval set in the same language as your corpus.