LLM GATEWAY · AI CONCEPT

What is an LLM gateway? Purpose, components, market status May 2026

An LLM gateway is a central proxy for language-model calls. It bundles routing, auth, rate limits, fallback, observability and cost tracking.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is an LLM gateway?

An LLM gateway is a software layer between your application and language-model APIs (OpenAI, Anthropic, Google, Mistral, Cohere, local Ollama). Instead of every application talking directly to every provider, all requests go to one central gateway. The gateway decides which model receives the request, checks permissions, limits volume, documents cost, and catches outages.

The term has been established since 2024 and is a standard building block for enterprise AI as of May 2026. The best-known projects are LiteLLM (open source, BerriAI, Apache 2.0), OpenRouter (hosted, marketplace with over 200 models), Portkey (hosted, with caching and guardrails), Cloudflare AI Gateway (hosted, on the Cloudflare Workers platform), AWS Bedrock (hosted, with its own model marketplace) and Azure OpenAI Service (hosted, primarily for OpenAI models). Each project covers a similar functional core but differs in hosting model, supported models and compliance properties.

The principle comes from the classical API-gateway pattern: one central point for auth, routing, telemetry. What HTTP endpoints are to a classical API gateway, model calls are to an LLM gateway – semantically the same, technically with quirks (streaming, token counting, long response times, model versions).

Why it matters

Without a gateway, a growing AI application quickly turns into a thicket: every application has its own API keys, its own error handling, its own cost logic. There is no central view of what is actually being consumed – and no central control in an emergency. Five concrete problems recur, all of which a gateway solves.

First: key sprawl. Without a gateway every app has its own API key. When an employee leaves, you have to know which keys are linked to which app – almost always patchy. A gateway offers virtual keys: each application gets an internal key validated against the master key, with its own budget and rights.

Second: no cost control. Model costs are usage-based and can explode (endless loop in an agent, faulty batch job). A gateway enforces hard limits – per key, per user, per day, per model. Exceeding triggers an alert or a block, not a five-figure monthly invoice.

Third: no multi-provider capability. As soon as you need two providers for regulatory reasons (e.g. Mistral-EU for client data + Claude for special tasks), the app must decide per call which provider fits. That is business logic that should not live in every app – but in the gateway.

Fourth: no resilience. OpenAI outages sometimes last hours. Without a gateway the application stops. With one, a fallback plan runs: try Claude, if that also fails Gemini, otherwise local Ollama as emergency mode. On the application level this is invisible.

Fifth: no audit trail. For compliance (revFADP, EU AI Act Art. 12, professional secrecy SCC 321) you need the answer: who called which model when, with which prompt, what came back, how long it took. On the application level this log is usually incomplete. In the gateway it is mandatory.

Functions in detail

A full LLM gateway implements seven function groups.

1. Multi-provider routing. A unified API (typically OpenAI chat-completions format) maps to different providers. LiteLLM in May 2026 supports more than 100 providers – from OpenAI, Anthropic, Google, Mistral to open-source models via Ollama, vLLM, TGI. The app writes once in OpenAI syntax; the gateway translates at runtime.

2. Auth and virtual keys. Instead of apps knowing the real provider keys, they get an internal key from the gateway. Per key: budget (e.g. CHF 200/month), allowed models, allowed endpoints, rate limits. If the key is compromised, deactivate it in the gateway – provider keys stay intact.

3. Rate limiting. Per key or per user: requests per minute, tokens per minute. Prevents accidental or malicious load spikes.

4. Fallback chain. Per logical model name, an ordered list of real models. Example: "smart-de" -> primary: claude-4.7-sonnet, fallback: gpt-4.1, last-resort: mistral-large-eu. On 429 errors or timeout the gateway transparently switches. The application only sees "smart-de", never the individual model names.

5. Observability. Per request, a structured log entry: timestamp, key, model, input tokens, output tokens, latency, cost (USD/CHF), success/failure. In May 2026 all gateways export Prometheus metrics and ship via OpenTelemetry into Grafana, Datadog or custom stacks.

6. Cost tracking. Per key, per model, per period – a bill. Important: providers publish prices per model; the gateway keeps the price table current. When prices change (e.g. OpenAI cut GPT-4o by 25% in May 2026), the gateway adapts – the application changes nothing.

7. Caching, guardrails, more. Optional but in May 2026 often included: answer caching for repeated identical prompts (LiteLLM cache, Portkey, Cloudflare AI Gateway), PII filter, prompt-injection filter, policy engine for content. This layer is growing fast and becomes a central compliance tool in 2026/27.

Gateway selection in 5 steps

01Clarify requirements: how many providers, how many apps, which compliance constraints (revFADP, EU AI Act, FINMA), which hosting model?
02Decide self-hosted vs managed: self-hosted (LiteLLM) for data sovereignty and cost control; managed (OpenRouter, Portkey, Cloudflare AI Gateway) for fast bootstrap.
03Check function coverage: virtual keys, rate limits, fallback chain, audit log, cost tracking, optionally caching and guardrails. In May 2026 all main products cover the core.
04Pilot with two providers: one for standard, one for fallback. Migrate one application. Measure latency, logs, cost.
05Roll out: migrate all applications to the gateway, issue virtual keys per app, integrate monitoring in Grafana, alerts to Telegram/Slack.

When a gateway pays off

Rule of thumb: from two models or two production apps onward, a gateway pays off. In May 2026 it is the standard for any AI solution beyond a single prototype.

Concrete trigger points: (a) you use several models in production (e.g. GPT-4 for standard, Claude for long-context, Mistral-EU for client data). (b) You run two or more AI apps (internal chat, RAG assistant for clients, automation pipeline). (c) Compliance requirements (revFADP DPIA, EU AI Act Art. 12 logging, professional secrecy audit) demand a central audit trail. (d) You want cost control per team, per client, per project. (e) You plan a provider strategy with fallback (resilience architecture).

A gateway is especially valuable when the risk of a provider outage is high. The 2024-2026 OpenAI incident history shows several multi-hour outages. An application with gateway + fallback ran through every one of them; one without stood still.

Self-hosted vs. managed: LiteLLM runs on Hetzner EU or a Cloudron server. Whoever does not want to operate another piece picks OpenRouter, Portkey or Cloudflare AI Gateway. Compliance preference: self-hosted (LiteLLM) is the more revFADP-friendly option since no data flow to a third party – but managed gateways increasingly offer EU options.

When NOT a gateway

Three clear cases where a gateway is overkill.

First: single prototype, single user, single model. A Streamlit demo, a Jupyter notebook for a research project – here the gateway adds complexity without benefit. Direct provider SDK calls suffice.

Second: strict on-premise with only one local model. Whoever runs only Ollama or vLLM on own hardware and no cloud models can talk to the model server directly. A gateway in front only pays once multiple models are swapped or audit requirements demand it.

Third: critical sub-millisecond latency. A gateway adds 5-20 ms (local) or 50-150 ms (managed). For typical chat apps this is irrelevant (generation time dominates). For inline code completion in an IDE or a voice agent with real-time demands, this latency can be felt – then direct SDK calls or a very thin proxy without caching/guardrails.

Watch out: a self-hosted gateway is an additional component that can fail. Whoever runs LiteLLM also operates LiteLLM – with health checks, updates, backups. For single-person setups with little ops capacity, a managed gateway (OpenRouter, Portkey) is often the calmer choice.

Trade-offs

STRENGTHS

One API for many providers – application code stays provider-neutral
Central cost control and audit log – compliance and accounting bonus
Resilience via fallback chain – provider outages get absorbed
Virtual keys enable team/client separation without multiple contracts

WEAKNESSES

Extra component – for self-hosted, ops overhead and SPOF risk
5-150 ms added latency per request – relevant for real-time use cases
Provider-specific features (e.g. Anthropic cache control) need gateway updates
For managed gateway: another data processor with its own data-processing agreement

FAQ

What does a gateway setup cost?

Self-hosted LiteLLM on Hetzner EU: CHF 15-30/month hosting plus setup work (4-12 consulting hours, CHF 800-2400). Managed: OpenRouter 5.5% surcharge on model costs (no fixed fee), Portkey CHF 0-200/month per tier, Cloudflare AI Gateway included in the Workers platform (typically CHF 5-20/month). For an SME with CHF 200/month of AI consumption, self-hosted is cheaper; for a test run, managed is faster.

Does a gateway work with Anthropic Claude and OpenAI at the same time?

Yes, that is the standard use case. As of May 2026 LiteLLM, OpenRouter, Portkey and Cloudflare AI Gateway support both providers in parallel – plus 50+ more. The application sends a call in OpenAI format; the gateway translates at runtime into Anthropic format and back. Streaming, tool use, JSON mode, vision – all supported with small provider-specific differences.

Do I see my users' prompts in cleartext in the gateway?

By default yes – that is the point of the audit trail. For sensitive data you configure masking: prompts are replaced by placeholders in the log, only metadata (model, token count, cost, success) remain. For revFADP DPIA and professional secrecy (SCC 321) the log policy in the gateway must match the application's data-protection policy – see ai-audit-trail-design.

Sources

LiteLLM – Documentation (BerriAI, Apache 2.0) · 2026-05
OpenRouter – Model Marketplace and Gateway · 2026-05
Cloudflare AI Gateway – Documentation · 2026-04
Portkey – AI Gateway with Guardrails · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call