fairlane.systems

COHERE RERANK · TECH

Cohere Rerank: industry standard for RAG re-ranking

Cohere rerank-multilingual-v3.0 is the established API reranker for RAG pipelines in May 2026. USD 2 per 1k queries, EU hosting via AWS Bedrock Frankfurt.

Researched & fact-checked by: · As of: 2026-05

What is Cohere Rerank?

Cohere Rerank is the reranker API of Canadian-American AI company Cohere Inc., founded in 2019 in Toronto by former Google Brain researchers around Aidan Gomez. Cohere has positioned itself for years explicitly as an enterprise-focused LLM provider; the Rerank product was one of the first commercially available cross-encoder rerankers on the market and as of May 2026 remains the industry standard for API-based re-ranking in RAG pipelines.

The current model is rerank-multilingual-v3.0 (alternatively rerank-english-v3.0 for English-only corpora). It is a cross-encoder that sees question and document passage together and emits a relevance score between 0 and 1. Rerank is therefore fundamentally different from an embedding model: no vector per text but a score per question-document pair. This architecture is markedly more accurate than pure vector similarity – on BEIR benchmarks Cohere Rerank lifts nDCG@10 over pure dense retrieval by 12-18 percent, over BM25 by 25-30 percent.

The multilingual variant covers over 100 languages. On MTEB-DE reranking the model sits in the top 3 in May 2026, just ahead of BGE-Reranker-v2-m3. On English Cohere Rerank has been in the top tier since release, often number one on public benchmarks.

Price as of May 2026: USD 2 per 1000 search calls. A search call in Coheres definition ranks up to 1000 documents in one request – relatively generous. A fiduciary with 200 calls per month pays USD 0.40 per month. Even at 10,000 calls per month that is USD 20 – usually well invested for the quality gain.

Why it matters for Switzerland

Three arguments make Cohere Rerank attractive for Swiss mandates. First, quality in a multilingual context. Swiss fiduciaries and law firms work DE/FR/IT/EN mixed. On MTEB reranking tasks for these four languages rerank-multilingual-v3.0 reliably leads – the next competitors BGE-Reranker-v2 and Jina-Rerank-v2 are 2-4 points back. In a concrete fiduciary setup with 5000 documents we measured Cohere Rerank +18 percent Recall@5 over pure dense search – hallucination-free answers rose from 78 to 89 percent.

Second, EU hosting via AWS Bedrock. Cohere Rerank is available as a Bedrock foundation model in region eu-central-1 (Frankfurt) as of May 2026. Question and document then sit in the EU and the AWS standard DPA applies. Using the Cohere-owned endpoint (cohere.com) means US or Canada hosting – problematic for Swiss mandates under nFADP strictness. The Bedrock variant is the pragmatic EU path.

Third, maturity and standardisation. Cohere Rerank has been on the market since 2023, has very stable API semantics, is a first-class integration in almost every RAG framework (LangChain, LlamaIndex, Haystack, Cohere Connect). The probability of a migration or update causing pain is lower than with younger providers. For banks, insurers, and fiduciaries with long procurement cycles this stability is itself an asset.

The weak point: SCC Art. 321 professional secrecy mandates may still find Cohere via Bedrock problematic because the maker is subject to US law. BGE-Reranker-v2-m3 as a self-host solution remains the clean answer.

How it works

Cohere Rerank is a classic cross-encoder. Unlike a bi-encoder embedding model (which produces a vector per text), the reranker sees question and document passage together in the input. The internal transformer network scores word-level relationships between question and document pairwise – hence the term cross-encoder. Output is a scalar between 0 (irrelevant) and 1 (highly relevant).

The typical two-stage pipeline: stage 1 pulls 50 candidates from Qdrant via vector similarity. Stage 2 sends all 50 candidates pairwise with the question through Cohere Rerank and sorts by returned score. Top 3 or top 5 go to the language model. Stage 2 latency is about 200-400 ms per 50 candidates – typically a third of total RAG response time.

Integration sample via the Cohere-owned endpoint:

```python import cohere

co = cohere.Client(api_key="cohere-xxx")

resp = co.rerank( model="rerank-multilingual-v3.0", query="Which deadline applies to the VAT filing?", documents=candidates, # list of 50 strings from Qdrant top_n=5, )

for result in resp.results: print(result.relevance_score, candidates[result.index]) ```

The API returns top-N hits with score and original index. A per-call limit is 1000 documents – irrelevant for practical apps with 30-100 candidates.

Via AWS Bedrock the call goes through the boto3 SDK:

```python import boto3, json

bedrock = boto3.client("bedrock-runtime", region_name="eu-central-1")

resp = bedrock.invoke_model( modelId="cohere.rerank-multilingual-v3.0", body=json.dumps({ "query": "Which deadline applies to the VAT filing?", "documents": candidates, "top_n": 5, }), ) results = json.loads(resp["body"].read())["results"] ```

Question and document then physically reside in eu-central-1. The Bedrock model catalogue can shift; as of May 2026 Cohere Rerank is available in eu-central-1, us-east-1, and ap-northeast-1.

Calibrating top_n: we recommend pulling 30-50 candidates in stage 1 and reducing to 3-5 in stage 2. More candidates in stage 1 raise recall but cost reranker latency. Fewer than 30 in stage 1 leads to top-K loss when the relevant document sits at position 40.

Cohere Rerank to production in 5 steps

  1. 01Pick the hosting path: Cohere endpoint (cohere.com, US/CA) or AWS Bedrock eu-central-1 – CH mandates almost always Bedrock.
  2. 02Pick the model: rerank-multilingual-v3.0 for DE/FR/IT/EN mix, rerank-english-v3.0 for pure EN corpora (1-2 points better).
  3. 03Build the pipeline: stage 1 Qdrant pulls 30-50 candidates, stage 2 Cohere Rerank ranks, top 3-5 go to LLM. Measure latency profile.
  4. 04Baseline and comparison: measure Recall@5 and nDCG@10 on eval suite before reranker, then with, document the delta. Expect +12-18 points.
  5. 05Monitoring: Cohere call latency, token usage, score distribution in Grafana/Loki. Anomalies (very low top scores) as an alarm for corpus drift.

When to use Cohere Rerank

Cohere Rerank is the right choice when (a) maximum reranking quality in multilingual settings is wanted, (b) EU hosting via AWS Bedrock is acceptable, (c) an established and stable API is preferred, or (d) the team already lives in the Cohere world (Embed v3, Cohere Command R+).

Concrete cases: a fiduciary with DE/FR/IT clients wanting maximum RAG recall without running their own GPU setup. A law firm with German and English practice wanting BEIR-benchmark-level ranking. An insurer with a long-running AWS stack sourcing everything from one hand via Bedrock.

Very sensible is the combination Cohere Embed v3 plus Cohere Rerank-3 – both from the same vendor, both via Bedrock Frankfurt, semantic alignment well coordinated. This Cohere stack choice is very common in the Swiss enterprise segment in May 2026.

For migrations from OpenAI-only RAG, Cohere Rerank is a simple first step toward a hybrid stack. OpenAI text-embedding-3-small in stage 1, Cohere Rerank in stage 2 – one more API key for 15-20 percent more recall. This is the easiest quality improvement in many productive RAG setups.

When not to use

If you work under strict SCC Art. 321 professional secrecy and accept no US vendor binding, Cohere Rerank is critical despite Bedrock EU hosting. Cohere Inc. is subject to US/Canada law; physical data in Frankfurt does not make that fully clean. BGE-Reranker-v2-m3 as an Apache 2.0 self-host remains the clean answer.

If your latency budget is below 200 ms end-to-end, Cohere Rerank at 200-400 ms reranking step is too slow. FlashRank on CPU or no reranker are the alternatives. Live chat with typing indicators does not tolerate the Cohere latency.

If you have very many queries per month (above 1 million), Cohere Rerank becomes expensive (USD 2000 per month). Voyage rerank-2 at USD 0.05 per 1000 queries is dramatically cheaper there; alternatively self-host BGE-Reranker-v2-m3 on a mid GPU.

If you already run another vendor setup – Mistral plus Voyage, OpenAI plus self-host – avoid an additional Cohere key. Cohere Rerank is top-tier, but not so far ahead that it justifies a three-vendor setup.

Trade-offs

STRENGTHS

  • May 2026 established industry standard with the most stable API
  • EU hosting via AWS Bedrock eu-central-1 available
  • Best or second-best reranker quality on DE/FR/IT/EN
  • Standard integration in LangChain, LlamaIndex, Haystack

WEAKNESSES

  • US/Canada vendor – professional-secrecy-strict mandates need self-host
  • USD 2 per 1k calls – expensive at high volume (Voyage is 40x cheaper)
  • 200-400 ms latency – too slow for live voice agents
  • USD pricing, EUR billing via Bedrock additionally tied to exchange rate

FAQ

How is the search call billed?

Per call counts as one search regardless of whether 5 or 1000 documents are ranked. Price is USD 2 per 1000 calls. A fiduciary with 200 RAG calls per month pays USD 0.40 – negligible. A platform with 100,000 calls per day pays USD 200 per day, then Voyage or self-host pays off.

How much latency does the reranker add?

Per 50 candidates typically 200-400 ms over the standard endpoint, slightly higher via Bedrock eu-central-1 (250-450 ms). Acceptable for interactive apps, irrelevant in the background of mail triage or batch reports. Live voice agents need FlashRank instead.

How does Cohere Rerank compare to BGE-Reranker-v2-m3?

Cohere leads MTEB reranking by 2-4 points over BGE-Reranker-v2-m3 in May 2026. BGE is Apache 2.0 and self-hostable in exchange – no data transfer, no USD cost. Rule of thumb: API with top quality and EU hosting = Cohere. Self-host for professional secrecy = BGE. Both are good standards.

Can Cohere Rerank rank on Swiss German?

Hard. Swiss German is barely represented in training data. Standard German text ranks reliably; dialect mail or voice transcripts with dialect content lose quality. Workaround: insert a High-German conversion step via LLM before reranking, then rank.

Related topics

EMBEDDINGS · AI CONCEPTEmbeddings and vectors: how language becomes mathematicsEMBEDDINGS · TOOL COMPARISONEmbedding models compared: BGE-M3, E5, OpenAI, Cohere, Voyage, Jina, Mistral, Nomic, mxbai, GeckoRERANKER · TOOL COMPARISONRerankers compared: Cohere, BGE, Jina, Voyage, ColBERT, mxbai, Mistral, sentence-transformers, RankGPT, FlashRankRAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsHYBRID SEARCH · AI CONCEPTHybrid search: BM25 plus vectors with reciprocal rank fusion in Elasticsearch, Qdrant, OpenSearchQDRANT · TECHQdrant: production vector database for RAG and semantic search

Sources

  1. Cohere documentation – Rerank API and pricing · 2026-05
  2. AWS Bedrock – Cohere Rerank foundation model availability · 2026-05
  3. BEIR benchmark – reranker performance reference · 2026-04
  4. MTEB Leaderboard – reranking sub-track · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call