fairlane.systems

FLASHRANK · TECH

FlashRank: ultra-fast reranker on CPU via ONNX runtime

FlashRank is an MIT-licensed Python library with small cross-encoder models, ONNX-optimised, under 100 ms on CPU.

Researched & fact-checked by: · As of: 2026-05

What is FlashRank?

FlashRank is an open-source Python library by Prithiviraj Damodaran, hosted on github.com/PrithivirajDamodaran/FlashRank under MIT licence. The library specialises in one task: cross-encoder reranking with extremely low latency on CPU. Unlike BGE-Reranker or Cohere Rerank with heavy 500M+ parameter models, FlashRank uses a selection of small models (4 to 35 million parameters) that, via ONNX runtime and Optimum tuning, run very fast on CPU – under 100 ms for 50 candidates on a standard CPU.

The library bundles several pre-converted models. As of May 2026 these are available: ms-marco-TinyBERT-L-2-v2 (4M parameters, fastest, simple quality), ms-marco-MiniLM-L-12-v2 (33M parameters, good compromise), ms-marco-MultiBERT-L-12 (multilingual, 35M parameters), rank-T5-flan (60M, qualitatively strongest in the FlashRank family). All are cross-encoders in classic ms-marco style – trained on the MS MARCO passage ranking task.

Quality-wise FlashRank models sit clearly below BGE-Reranker-v2-m3 or Cohere Rerank – typically 5-10 nDCG@10 points less on MTEB reranking. The gap is significant, but in exchange you get three advantages decisive in certain setups. First: extremely low latency on CPU without GPU requirement. Second: small models (4-60 MB instead of 2 GB) that fit in a container image layer. Third: simple API with three lines of code from install to first rerank.

For live chat, voice agents, or edge setups where 200 ms reranking latency is too much, FlashRank is the obvious choice in May 2026. Anyone building a standard RAG stack with a sub-second latency budget is better served by Cohere Rerank or BGE-Reranker.

Why it matters for Switzerland

Three reasons make FlashRank interesting in the Swiss context. First, the latency world of live applications. Swiss fiduciary and law offices increasingly build voice interfaces – dictation, phone triage, voice-driven case search. In these setups end-to-end latency from mic to answer must stay under 800 ms, otherwise the system feels slow. A 300 ms reranker fills the budget too fast. FlashRank at 50-100 ms is the only practical option for self-host RAG here.

Second, MIT licence and small footprint. Anyone building a RAG pipeline small enough to live in a 1 GB container image and run on a 4 GB RAM VM gets further with FlashRank than with BGE-Reranker (2.3 GB model, 8 GB VRAM recommended). This lean architecture is in demand in the Swiss SME market as of May 2026 – many mandates want RAG functionality but not the full enterprise infrastructure.

Third – an underrated point – simple API and maintenance. FlashRank installs via pip install flashrank, the model loads on first call, further configuration is minimal. A single backend engineer can integrate the reranker into an existing RAG pipeline in an hour. Compared to BGE-Reranker (FlagEmbedding library, GPU setup, container maintenance) that is a clear effort difference.

The weak point: quality. FlashRank models are 5-10 points behind Cohere or BGE-Reranker-v2-m3. In a high-precision law setup where the correct ruling between positions 4 and 5 matters, this is noticeable. For standard fiduciary cases or for a voice agent guiding clients, quality is enough – a double-digit recall gain over stage 1 still holds.

How it works

FlashRank is a Python library bundling small cross-encoder models in ONNX format and executing them via Microsoft Optimum runtime with quantisation and operator fusion. Models were originally trained in PyTorch (on MS MARCO passage ranking), then converted to ONNX and optimised for CPU via dynamic int8 quantisation. The result: a 33 million parameter model that handles about 500 reranks per second on a standard x86 CPU.

Install and basic use:

```bash pip install flashrank ```

```python from flashrank import Ranker, RerankRequest

ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2")

request = RerankRequest( query="Which deadline applies to the VAT filing?", passages=[ {"id": 1, "text": "The VAT filing must occur within 60 days."}, {"id": 2, "text": "Late payment triggers a default fee."}, # 30-50 more candidates from stage 1 ], )

results = ranker.rerank(request) # results is sorted by score, take top 5 ```

The model downloads on first call from a CDN mirror (model cache in the ~/.cache/flashrank directory). You can bypass this by manually placing model files in the cache directory – important in air-gap setups where no internet access is allowed.

For production we recommend the wrapper in a FastAPI endpoint:

```python from fastapi import FastAPI from flashrank import Ranker, RerankRequest

app = FastAPI() ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/app/cache")

@app.post("/v1/rerank") async def rerank(query: str, documents: list[dict], top_n: int = 5): request = RerankRequest(query=query, passages=documents) results = ranker.rerank(request) return {"results": results[:top_n]} ```

Model selection in the FlashRank family: - ms-marco-TinyBERT-L-2-v2 (4M, ~50 ms for 50 docs, basic quality): only for ultra-extreme latency cases. - ms-marco-MiniLM-L-12-v2 (33M, ~80 ms for 50 docs, good quality): recommended standard for English setups. - ms-marco-MultiBERT-L-12 (35M, ~90 ms for 50 docs, multilingual): the right pick for DE/FR corpora. - rank-T5-flan (60M, ~150 ms for 50 docs, best FlashRank quality): for setups with slightly more latency budget.

Important: all FlashRank models have a 512-token context limit. Documents get truncated on long passages. For legal text with long clauses, chunk to 400 tokens before reranking.

FlashRank to production in 5 steps

  1. 01Pick the model: ms-marco-MiniLM-L-12-v2 as default for English, ms-marco-MultiBERT-L-12 for DE/FR, rank-T5-flan when quality matters more than latency.
  2. 02Install via pip install flashrank, set the model cache directory explicitly (cache_dir parameter), include in the container image as a layer.
  3. 03Build the FastAPI wrapper: POST /v1/rerank with query, passages, top_n. Load the model once at app init, reuse afterwards.
  4. 04Pipeline integration: stage 1 Qdrant pulls 30-50 candidates, stage 2 calls the FlashRank endpoint, top 3-5 go to stage 3 (LLM). Measure latency profile.
  5. 05Eval suite against baseline: measure Recall@5 before FlashRank, then with, document the delta. Expect +8-12 points over pure vector search. If more needed: two-stage rerank FlashRank + Cohere/BGE-Reranker.

When to use FlashRank

FlashRank is the right choice when (a) end-to-end latency must stay below 800 ms, (b) no GPU is available or wanted, (c) container image should stay minimal, or (d) an air-gap or edge setup without internet access is required.

Concrete cases: a voice agent for client telephony matching incoming questions against the knowledge base in under 1 second. A live chat widget on a fiduciary website with typing indicator needing a fast RAG backend. An edge deployment in a client office where a local RAG stack must run on mini-PC hardware (Intel NUC, ARM Mac mini).

FlashRank is also attractive for batch setups with large volumes. Anyone processing one million queries per day without GPU clusters can compute through with FlashRank on a handful of standard CPUs. Quality is lower than Cohere or BGE-Reranker, but on standard RAG tasks over client files even FlashRank-level brings a clear recall gain over pure dense search.

A smart combo strategy: FlashRank as stage 1.5 before a heavier reranker. Stage 1: Qdrant pulls 200 candidates. Stage 1.5: FlashRank reduces to 30 in 100 ms. Stage 2: Cohere Rerank or BGE-Reranker ranks the 30 in 150-300 ms to 5. Overall faster than one heavy reranker on 200 candidates – and qualitatively better because the second reranker still operates in its sweet spot.

When not to use

When reranking quality is the top goal, FlashRank is clearly inferior. BGE-Reranker-v2-m3 or Cohere Rerank deliver 5-10 more nDCG@10 points. In high-precision setups (legal research, medical diagnosis support, regulatory compliance) this difference is significant. FlashRank is the wrong tool here.

If your corpus has many long documents – passages above 512 tokens – FlashRank truncates and loses context. BGE-Reranker at 8192 tokens or Cohere Rerank with similarly long context is better, often the right choice for legal setups.

If you already have a GPU in the setup (say for embedding with BGE-M3 or a local LLM), the FlashRank advantage is wasted. BGE-Reranker on the same GPU runs at better quality at comparable latency. FlashRank primarily makes sense when no GPU is in play.

If your application demands multilinguality with a strong DE/FR/IT share, the FlashRank choice is limited – only ms-marco-MultiBERT-L-12 is really multilingual, and even that is 3-5 points behind BGE-Reranker on MTEB reranking DE. For multilingual high quality BGE-Reranker-v2-m3 remains the better choice, even if it demands a GPU setup.

Trade-offs

STRENGTHS

  • MIT licence, completely free, no commercial restriction
  • Extremely low latency on CPU – under 100 ms for 50 candidates
  • Small models (4-60 MB), fit in container image layers
  • Simple API – three lines of code from install to first rerank

WEAKNESSES

  • 5-10 nDCG@10 points behind BGE-Reranker and Cohere Rerank
  • Context limit 512 tokens – long legal text gets truncated
  • Multilinguality only via MultiBERT variant, not all models
  • Models are older (ms-marco-trained), no 2026 state-of-the-art quality

FAQ

How much worse is FlashRank than BGE-Reranker?

On MTEB reranking FlashRank (MiniLM-L-12) sits about 5-10 nDCG@10 points behind BGE-Reranker-v2-m3. In a concrete fiduciary setup we measured +8% Recall@5 with FlashRank vs. +18% with BGE-Reranker – both over pure vector search. FlashRank is not as good, but better than no reranker.

Does FlashRank work on German?

Limited. Only the ms-marco-MultiBERT-L-12 model is explicitly multilingual and understands German decently. ms-marco-MiniLM and TinyBERT are English-centric; on German they lose 10-15 recall points. For CH setups with DE/FR/IT mix always pick the MultiBERT model.

Can FlashRank run in an air-gap setup?

Yes. Download model files once with internet access, place them in the model cache directory (~/.cache/flashrank or via cache_dir), then FlashRank runs offline. This makes it the first pick for notaries, air-gap mandates, and edge deployments without connectivity.

How do I combine FlashRank with a heavier reranker?

Two-stage reranking: stage 1 Qdrant pulls 200 candidates. Stage 1.5 FlashRank reduces to 30 in 100 ms. Stage 2 BGE-Reranker or Cohere ranks the 30 to 5 in 200 ms. Overall faster than a single reranker on 200 docs, similar quality. The pattern pays off from 100+ candidates in stage 1.

Related topics

EMBEDDINGS · AI CONCEPTEmbeddings and vectors: how language becomes mathematicsEMBEDDINGS · TOOL COMPARISONEmbedding models compared: BGE-M3, E5, OpenAI, Cohere, Voyage, Jina, Mistral, Nomic, mxbai, GeckoRERANKER · TOOL COMPARISONRerankers compared: Cohere, BGE, Jina, Voyage, ColBERT, mxbai, Mistral, sentence-transformers, RankGPT, FlashRankRAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsHYBRID SEARCH · AI CONCEPTHybrid search: BM25 plus vectors with reciprocal rank fusion in Elasticsearch, Qdrant, OpenSearchQDRANT · TECHQdrant: production vector database for RAG and semantic search

Sources

  1. FlashRank GitHub – PrithivirajDamodaran/FlashRank repository · 2026-05
  2. ms-marco passage ranking – base task for FlashRank models · 2026-04
  3. Microsoft Optimum – ONNX runtime tuning for transformers · 2026-05
  4. MTEB Leaderboard – reranking sub-track · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call