ROUTING · AI CONCEPT

Multi-LLM routing: which model when, for how much

Routing rules by sensitivity, cost, latency, and quality. Fallback chain, semantic caching, cost observability. May 2026 pricing.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is multi-LLM routing?

Multi-LLM routing is the practice of picking the most fitting language model for each individual request – instead of pushing everything through one model. The idea: a simple email classifier needs no USD 15-per-million-tokens model like Claude Opus 4 or GPT-4o. Mistral 7B or GPT-4o-mini do the same classification at 5–10% of the price with comparable accuracy. Complex legal reasoning, by contrast, benefits from a top-tier model.

In May 2026 the model landscape has differentiated almost fully. OpenAI, Anthropic, Google, Mistral, Cohere, Meta, and DeepSeek each offer several tiers – from tiny edge models (Llama 3.2 1B, Phi-4-mini) to reasoning specialists (o3, Claude Opus 4, Gemini 2.5 Pro). Routing every request to the same model costs 5–20x more than necessary.

Operationalisation runs through an LLM router. Proven solution: LiteLLM (open-source, Python, OpenAI-compatible API, built-in router with cost tracking, caching, fallback). Alternative: OpenRouter (hosted, less control, faster setup). Both wrap provider-specific APIs behind a single interface.

Why it matters

Three reasons: cost, data residency, availability.

Cost (as of May 2026, USD per 1M tokens input/output, source: respective provider pricing pages):

- GPT-4o: 2.50 / 10.00 - GPT-4o-mini: 0.15 / 0.60 - Claude Opus 4: 15.00 / 75.00 - Claude Sonnet 4: 3.00 / 15.00 - Claude Haiku 4: 0.25 / 1.25 - Gemini 2.5 Pro: 1.25 / 5.00 - Mistral Large 2: 2.00 / 6.00 - Mistral Small 3: 0.20 / 0.60 - DeepSeek V3: 0.27 / 1.10

For classification and extraction (80% of volume in a Swiss fiduciary SME), GPT-4o-mini or Mistral Small at 0.15–0.20 USD/1M is the right choice. For complex client questions, GPT-4o or Claude Sonnet 4 at 2.50–3.00 USD/1M. Only for legal reasoning without latency pressure: Claude Opus 4 or o3. A routing setup typically saves 60–80% versus "everything through the top model".

Data residency: not every request may be processed everywhere. Client data under SCC Art. 321 must not go to providers that host in the US without an adequacy assessment. Mistral (EU-hosted, Paris) and local models (Ollama with Llama 3.1 70B) are mandatory options for sensitive data. A router enables the rule "PII → Mistral EU; non-sensitive → GPT-4o-mini".

Availability: every provider API fails occasionally. Anthropic had several multi-hour outages in 2024. A fallback chain (primary: Claude Sonnet 4, secondary: GPT-4o, tertiary: local) makes the AI system production-stable.

Routing rules in practice

A production-ready router decides on four axes:

Sensitivity routing: a classifier (or a tag in the API call) marks the sensitivity. Three tiers usually suffice: public, internal, confidential. Rule: confidential → EU or local (Mistral, Ollama). Internal → EU preferred. Public → cost-optimal provider.

Complexity routing: the task selects the model tier. Classification/extraction → mini tier (GPT-4o-mini, Haiku, Mistral Small). Standard Q&A → mid tier (GPT-4o, Sonnet, Mistral Large). Reasoning, legal analysis, code reviews → top tier (Opus, o3). LiteLLM supports classification via a tag in the API call or via heuristic (prompt length, keywords).

Latency routing: real-time requests (chatbot) → fast models (Haiku, GPT-4o-mini, Gemini Flash). Batch jobs (overnight) → slower, cheaper models (DeepSeek V3, local Llama 70B). Reasoning models like o3 often need 10–30 seconds – unfit for user-facing chat, perfect for async reports.

Fallback chain: three to four models in a sequence. LiteLLM tries primary, falls on timeout/500/rate-limit to secondary, then tertiary. Important: the chain must be compatible – same output format, comparable context window. A sensible fiduciary chain: Claude Sonnet 4 (primary) → GPT-4o (secondary) → Mistral Large 2 (tertiary, EU-hosted, emergency).

Semantic caching: identical or semantically similar queries are cached. Instead of asking twice "what does a GmbH formation in Zug cost?", the cache delivers the second answer instantly. Tools: Redis backend with embedding comparison (threshold e.g. 0.95 cosine), or GPTCache by Zilliz. In typical setups, LLM call volume drops 20–40%. Caveat: only cache where permitted – no client-specific answers without a client key in the cache key.

Cost observability: without tracking, routing is blind. Langfuse (open-source, EU hosting possible), Helicone (US-hosted), or LiteLLM internal logging records per request: model, tokens, latency, cost, success/failure. Monthly reports show whether routing rules hold or whether 80% of traffic ends up on the expensive model because classification fails.

Routing setup in 6 steps

01Measure current traffic: one month of tokens per use-case, answer quality, latency, data sensitivity. Without data, routing design is superstition.
02Build a model matrix: per use-case set the right tier (mini/mid/top) and data residency (US/EU/local). 3–5 models suffice – more makes the setup brittle.
03Set up LiteLLM (Docker container, OpenAI-compatible API). Provider keys as environment variables, never in code.
04Define routing rules: by tag (`metadata.sensitivity = confidential` → mistral-eu), by model tier (`metadata.task = classify` → gpt-4o-mini), fallback chain per tier.
05Enable semantic caching with a Redis backend, threshold 0.95 cosine, TTL by sensitivity (public 30 days, internal 7 days, confidential no cache).
06Observability via Langfuse or LiteLLM logs: monthly cost dashboard per use-case and model. On drift (> 20% deviation), adjust routing rules.

When to use multi-LLM routing

Once an AI system burns more than roughly CHF 200/month in LLM cost or processes client data, routing pays. At CHF 100 monthly token cost, routing setup (1–3 days) amortises in 6 months; above CHF 500, in 2 months. With sensitive data, it is mandatory, not optional.

Concrete Swiss SME cases: fiduciary with client chatbot (GPT-4o-mini for FAQ, Sonnet 4 for complex questions, Mistral EU for tax-file answers) – typical monthly saving CHF 200–600 versus "all GPT-4o". Law firm with contract analysis (Opus 4 for clause review, Sonnet for standard contracts, local Llama 70B for client correspondence under professional secrecy). Insurance broker with claim classification (Mistral Small for 90% of tickets, GPT-4o for escalated cases).

When routing is overkill

On very small setups (under 50,000 tokens per month) savings stay below CHF 5 – setup effort is not justified. If the AI system has only one well-defined use-case (e.g. only receipt extraction), a single fitting model is right – routing would be overengineering.

Also problematic: routing between models with clearly different behaviour (e.g. Claude vs. Llama). Answers vary in tone, refusal behaviour, format. To keep the user impression consistent, stay within one family for routing (e.g. Haiku/Sonnet/Opus from Anthropic) or insert an output normalisation layer. Otherwise the client sees different "personalities" depending on the question.

Trade-offs

STRENGTHS

Typically 50–70% cost reduction versus "everything through the top model"
Per-request data residency (PII to EU, public to US)
Failure resilience via fallback chain – system survives provider outages
Semantic caching cuts call volume by 20–40% on recurring questions

WEAKNESSES

Setup effort 1–3 days; pays only from roughly CHF 200/month LLM cost
Multiple providers = multiple contracts, multiple invoices, multiple SLAs
Inconsistent behaviour across model families (tone, refusal, format)
The routing classifier can itself misjudge and send expensive traffic to the expensive model

FAQ

How much does routing really save?

For a typical Swiss fiduciary SME with mixed traffic (60% classification/extraction, 30% Q&A, 10% reasoning), LLM costs drop from roughly CHF 800/month to CHF 200–300/month at unchanged quality. Semantic caching adds another 20–40% volume reduction. Conservative estimate: 60% saving after 2–3 months of routing-rule tuning.

Which router is standard in 2026?

LiteLLM is the most used open-source solution – OpenAI-compatible API, routing, caching, cost tracking, fallback built in. Alternatives: OpenRouter (hosted, less control), Portkey (hosted, more enterprise features), Helicone (mainly observability). For Swiss SMEs with revDSG requirements, self-hosted LiteLLM on Hetzner is the default: fully under own control, no data to third parties except the providers the router calls anyway.

Is semantic caching GDPR-compliant?

Conditionally. The cache holds queries and answers – if personal data ends up there, the cache must be treated like any other data store: purpose limitation, retention, access control, documented processing. Practically: per-client cache (client ID in cache key), short TTL (max. 7 days for internal data, no cache for professional-secrecy answers), audit log of all cache hits. With these measures, caching is clean under revDSG/GDPR.

What happens during a provider outage?

The fallback chain steps in. LiteLLM detects 500s, timeouts, and rate limits and automatically falls to the next model. Precondition: the models in the chain are compatible (same output format, compatible context window). Recommended configuration for Swiss fiduciaries: primary Claude Sonnet 4 (US-hosted), secondary GPT-4o (US-hosted), tertiary Mistral Large 2 (EU-hosted). The system then survives even a full US-cloud outage.

Sources

LiteLLM – Router & Proxy Documentation · 2026-04
Hu et al., RouteLLM: Learning to Route LLMs with Preference Data (arXiv) · 2024-06
Langfuse – Cost & Usage Tracking Docs · 2026-03
OpenAI – Models & Pricing · 2026-05
Anthropic – Claude Pricing · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call