MULTI-LLM GATEWAY · SERVICE

Multi-LLM Gateway: eight providers, one entry point, compliance routing

LiteLLM gateway with auth, routing by cost/speed/privacy, usage dashboard. Module from CHF 1,000, project with observability CHF 4,500.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is a Multi-LLM gateway?

A multi-LLM gateway is a central entry point for all AI requests in your business. Instead of spreading eight different API keys across eight different applications, there is one address: your gateway. Applications speak to it in a single protocol (OpenAI-compatible). The gateway decides which model actually handles the request – by rules you define.

We build this on LiteLLM, the established open-source gateway (over 17,000 GitHub stars as of May 2026, weekly releases). LiteLLM supports more than 100 LLM providers and translates their native APIs into the OpenAI format. You write your application against one protocol and can swap models without touching code.

Our standard build connects eight providers: OpenAI (the current top GPT model, GPT-4o), Anthropic (the current top Claude model and 3.5), Mistral (Large, Codestral) as an EU-hosted vendor, Cohere (Command R+) also EU-friendly, Google (Gemini 2.5 Pro), Meta models via Together or Groq, DeepSeek (V3 and Coder) as a cost option, and a local model via Ollama (Llama 3.3 70B or Qwen 2.5) for data that must not leave the building.

Variants: Module (CHF 1,000) as a pure LiteLLM setup with authentication and routing rules. Project with observability and SSO (CHF 4,500) additionally with Langfuse to trace every request, SSO into your identity provider, and a per-department usage dashboard with a monthly cost report.

Why it matters

The gateway solves three problems at once.

Vendor lock-in. If your 30 internal apps speak directly to OpenAI and OpenAI doubles the price tomorrow or pushes a new tier, you are stuck. With a gateway you change one routing rule in one place – the model redirects to Mistral or Claude, and the apps notice nothing.

Compliance routing. The real data-protection argument. You define per application or per data class which models are allowed. Client data goes only to "EU-only" (Mistral, Cohere, local Ollama). Generic code suggestions can go to the cheapest provider. Marketing copy can go to the strongest. One central rule instead of eight app configs.

Cost control. The gateway measures every request: which department, which model, how many tokens, what latency. That prevents the classic SME phenomenon where a marketing intern builds a loop into GPT-4 and the month-end bill is CHF 8,000. With per-key and per-team limits – a standard LiteLLM feature – you can set department budgets and block in advance.

LiteLLM has been in production since 2023 at companies such as Netflix, Spotify and many others – the software is stable, the pattern is proven.

How it works

Architecturally the gateway is a single container that sits in front of your applications.

Ingress. Applications talk to the gateway over HTTPS in the OpenAI format (POST /v1/chat/completions, /v1/embeddings, /v1/audio/speech). Each application receives a virtual API key (sk-...) bound to a team or person with a budget limit.

Authentication. The gateway verifies the key against its internal Postgres DB. In the project variant SSO comes in: applications authenticate via your identity provider (Azure AD, Google Workspace, Authentik), and the gateway recognises the person and department from the OIDC token.

Routing. Now the router decides. For every "model label" (e.g. `gpt-4`, `claude`, `cheap`, `eu-only`) it knows a list of backend deployments. The default strategy is simple-shuffle with weights – on failure it falls back to the next deployment. You can tag-route ("`tag=client-data`" forces the request to `eu-only`), weight (70 percent to the cheaper model), or chain fallbacks (Mistral Large first, on rate limit the current top Claude model).

Observability. With Langfuse or the built-in OpenTelemetry export every request is logged: prompt hash, chosen model, token counts, cost, latency, errors. That feeds your Postgres DB plus a Grafana dashboard. For sensitive data the logging can be redacted – prompts are hashed, not stored in clear text.

Provider call. The gateway translates the OpenAI request into each provider dialect (Anthropic Messages API, Google Vertex, Cohere v2, etc.), retrieves the response, normalises back to OpenAI format. The application sees the same response regardless of which model actually answered.

Gateway setup in 6 steps

01Define the provider list: which 4 to 8 models cover your use cases?
02Define data classes: public / internal / confidential / professional secrecy – and which model may see which class?
03Deploy the LiteLLM container, attach Postgres DB, create virtual keys per team.
04Write routing rules: tag-based, fallback chains, weighted load balancing.
05Wire up observability: Langfuse for request tracing or OpenTelemetry export to Grafana.
06Set per-key budget limits, alert at 80 percent consumption – and roll out to applications.

When a gateway is worth it

A gateway is worth it when (a) you use or plan to use more than one AI model, (b) you handle multiple data classes (public, internal, confidential, professional secrecy), (c) several departments or teams use AI and you need per-area budgets, or (d) you need a traceable vendor strategy because compliance or audit demands it.

Concrete triggers we have seen: a law firm uses Claude for client correspondence (EU-friendly via Anthropic Bedrock in Frankfurt) and DeepSeek for general research – one key per lawyer with a monthly budget. A fiduciary mixes Mistral Large for text and local Ollama for client data that must not leave the building. An SME with three locations routes per IP range to different providers – the marketing site can use everything, accounting only EU models.

Break-even sits around CHF 200 in model spend per month. Below that the setup complexity rarely pays off – a direct provider connection with a single key is simpler.

When not

A gateway is the wrong choice when (a) you use only a single model and never want to switch, (b) your application talks only via one well-observable service (e.g. only ChatGPT Team), or (c) you have less than 50 AI requests per day today.

At the same time: if you do not want your own hosting and only want a curated cloud solution, there are managed gateway providers (e.g. OpenRouter, Portkey, Helicone). That works – but you partly give up the central control benefit, because all requests run through a US cloud service. For the Swiss market with professional-secrecy data, self-hosted LiteLLM is the more honest model.

And: a gateway does not solve the hallucination problem. It routes, measures, controls – but does not make the model smarter. Anyone who needs correct answers from their own knowledge also needs RAG (see "RAG on your own knowledge").

Trade-offs

STRENGTHS

Vendor swap in one place without changing app code
Compliance routing by data class – sensitive data only to EU models
Cost control per key and team with hard budgets
Full observability: every request logged, token counts, latency, errors
OpenAI-compatible – no app needs code changes

WEAKNESSES

An additional component to maintain
Single point of failure in single-instance setups – failover needed for high availability
Setup effort rarely pays off below about CHF 200/month in model spend
Does not solve hallucination – that needs RAG in addition

FAQ

What happens if LiteLLM itself goes down?

In the standard configuration the gateway runs as a systemd service with auto-restart and a health check. If needed we run two instances behind nginx with failover – for 99.95 percent availability. With a single instance and machine failure recovery is in minutes provided the DB replica exists.

Does this work with streaming?

Yes. LiteLLM passes Server-Sent Events through transparently – applications that need token-by-token streaming (chat UIs, code assistants) work without changes. Latency overhead through the gateway: a few milliseconds.

How does it integrate with OpenAI SDK or LangChain?

Directly. Both libraries only need a `base_url` parameter – you point it at your gateway instead of api.openai.com. The code stays identical. Embedding libraries (LlamaIndex, Haystack, LangChain) also work via the embeddings endpoint.

What does running the gateway itself cost?

The container needs 256 MB RAM and negligible CPU. On an existing server it is irrelevant – you pay the model costs of each provider plus a small Postgres DB. LiteLLM itself is open source and free.

Sources

LiteLLM – Official documentation · 2026-05
LiteLLM Proxy – Routing & Load Balancing docs · 2026-04
BerriAI/litellm – GitHub repository · 2026-05
Langfuse – LLM observability platform · 2026-04
Nerd Level Tech – LiteLLM Proxy Production Tutorial 2026 · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call