fairlane.systems

LITELLM · TECH

LiteLLM: one gateway for 100+ LLM providers behind a single API

LiteLLM is an open-source proxy that bundles OpenAI, Anthropic, Mistral, local models and more behind a single OpenAI-compatible API.

Researched & fact-checked by: · As of: 2026-05

What is LiteLLM?

LiteLLM is an open-source project by BerriAI (Apache-2.0 license) that places a unified entry point in front of more than 100 LLM providers. Whether OpenAI, Anthropic, Mistral, Cohere, Google Vertex, Azure, AWS Bedrock, DeepSeek, Groq, Perplexity or a local Ollama instance – every request goes through the same OpenAI-compatible REST interface (POST /v1/chat/completions). The application code does not change when a provider is swapped.

LiteLLM runs either as a Python library inside the application process or – in production setups – as a standalone proxy server (Docker image ghcr.io/berriai/litellm-stable). Proxy mode adds team-level virtual API keys, per-team budget limits, fallback chains, rate-limiting, caching, and observability hooks for Langfuse, Helicone, Datadog and PostgreSQL logging. Version 1.50+ is stable as of May 2026; the project has been active since 2023 and holds over 14,000 GitHub stars.

On the fairlane.systems infrastructure, LiteLLM runs on port 4100 and bundles 24 models from 8 providers. Every production application – from client chat to n8n workflows to RAG queries – goes only through this proxy. That is the basis for routing by data-protection tier, for cost control, and for swapping providers without code changes.

Why it matters

Anyone writing LLM applications directly against provider SDKs builds technical debt into every module. Three problems show up in every larger setup, and LiteLLM solves them in one place.

First: lock-in. OpenAI, Anthropic and Mistral have different SDKs, different token limits, different error codes. A codebase that talks to all three directly has three failure surfaces instead of one. When the preferred provider gets more expensive or goes down, switching is a refactor, not a configuration entry. With LiteLLM, the provider is a model-config entry in a YAML – the application stays untouched.

Second: data-protection routing. A Swiss fiduciary must not send client PII to US providers, but can send anonymised research queries to GPT-4o without issue. Without a gateway, that rule is scattered across the code; with LiteLLM, it lives in one place. Model names like mistral-eu-secure, claude-haiku-eu, or local-llama mark the tier, and the proxy decides the route.

Third: cost and observability. Each provider has its own dashboards, its own billing cycle, its own model prices. A cross-provider view – who used how much, which workflow costs what – cannot be built without an aggregation layer. LiteLLM writes every request with model, token count, cost, and latency to Postgres and exports metrics to Prometheus, so Grafana can show a single per-tenant cost view.

How it works

The LiteLLM proxy is a Python FastAPI application in a Docker container. Its config lives in a config.yaml with three core sections: model_list, litellm_settings, general_settings.

In the model_list block, each model gets a logical name and a provider mapping. Example: an entry with model_name: mistral-eu-secure points via litellm_params: model: mistral/mistral-large-latest at Mistral La Plateforme, and a second entry called mistral-eu-fallback at a Mistral instance on Azure EU. Both share the same logical name as a fallback group – if the first returns 5xx, the second takes over.

Virtual keys are the second core feature. The proxy keeps provider master API keys internally; outwards, it issues virtual keys (sk-litellm-...) to teams, applications, or individual clients. Each virtual key has a monthly budget (e.g. CHF 50 for a pilot client), a model whitelist (e.g. only mistral-eu-*), and a rate limit. When a key exceeds its budget, the proxy blocks further requests – the provider master key stays untouched.

Observability runs through callback hooks. On every response, the proxy fires an event to configured sinks: Langfuse for prompt/response tracing, PostgreSQL for an audit log (audit-grade for Art. 957a CO), Prometheus for metrics. The audit log holds timestamp, virtual key, model, token count, cost, prompt hash (not plaintext, for confidentiality cases), and response latency.

Fallback and retry are defined per model group. A typical config: three attempts with exponential backoff, then fallback to the next model in the group. Latency-based routing (least-busy or weighted round-robin) spreads load across multiple model instances.

LiteLLM setup in 6 steps

  1. 01Inventory providers: which LLMs are in use, which API keys, which models per use case?
  2. 02Create config.yaml with model_list: each logical model name with provider mapping, fallback groups, rate limits.
  3. 03Start the Docker-Compose stack: LiteLLM container on port 4100, plus PostgreSQL for audit log and virtual keys.
  4. 04Generate the master admin key and issue virtual keys per team/client with budget and model whitelist.
  5. 05Wire up observability: Langfuse for tracing, Prometheus scraper for metrics, Grafana dashboard for per-client cost.
  6. 06Switch applications: base_url=http://litellm:4100/v1, api_key=sk-litellm-<team>, model names from the YAML. Test, then cut over.

When to use LiteLLM

LiteLLM pays off as soon as (a) more than one LLM provider is in play, (b) multiple applications or clients share LLM access, or (c) data-protection tiers must be routed.

Concrete triggers: a Swiss fiduciary wants Mistral EU for client data and GPT-4o for open research in the same workflow. An SME pilots RAG and a voice bot in parallel – both applications should have separate budgets and model whitelists. An office plans n8n automation plus a chat assistant – both need audit logging and per-client cost reporting.

The proxy also makes sense for a single application when a fallback is desired. Example: the client chat should automatically switch from Anthropic to Mistral on Anthropic outages without the code noticing. That takes 5 lines of YAML in LiteLLM; without a gateway it is a custom piece of code in every service.

When not to use

For a single, small application with exactly one provider and no client separation, LiteLLM is overhead with no upside. A weekend project against the OpenAI library does not need a proxy.

Equally unsuited: latency-critical applications where every millisecond matters. The proxy adds 10-30 ms; for voice streaming or real-time bots that can be material. Direct provider integration with dedicated fallback code may be the better fit.

Anyone living in Azure-only or AWS-Bedrock-only environments already has gateway features there (Azure OpenAI Service with content filtering, AWS Bedrock with Guardrails). If the requirement is fully covered, an additional proxy layer is redundant.

Trade-offs

STRENGTHS

  • One API surface for all providers – application code stays stable when switching providers
  • Virtual keys with budget, whitelist, and rate limit per team or client
  • Central audit log and cost reporting across all providers
  • Fallback chains and latency routing via YAML config

WEAKNESSES

  • Adds 10-30 ms latency per request
  • One more stack component – needs monitoring and updates
  • YAML configuration can get hard to read with many models – version control matters
  • Overhead without benefit in single-provider or single-application projects

FAQ

What overhead does the proxy add?

Typically 10-30 ms of extra latency per request versus a direct provider call, plus minimal CPU overhead. At production load (several hundred requests per hour), a 2-vCPU container is enough; at thousands of requests per minute, you scale horizontally behind a load balancer.

Can LiteLLM route embeddings and image models too?

Yes. The proxy supports /v1/embeddings (OpenAI, Cohere, Voyage, local models), /v1/audio/transcriptions (Whisper, Deepgram), /v1/images/generations (DALL-E, Stable Diffusion), and /v1/chat/completions with vision. Re-ranking endpoints are also covered.

Is LiteLLM audit-ready for Art. 957a CO?

The proxy writes every request with timestamp, model, tokens, cost, and response hash to PostgreSQL. If that table is WORM-secured (e.g. append-only tablespace plus a regular hash anchor in external storage), it meets CO requirements for audit-grade business correspondence. We provide the hash anchor in the managed service.

Related topics

MULTI-LLM GATEWAY · SERVICEMulti-LLM Gateway: eight providers, one entry point, compliance routingROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchSELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciariesOLLAMA · TECHOllama: local LLMs on your own hardware – where it works and where it does notGRAFANA · TECH STACKGrafana, Prometheus, Loki: monitoring stack for container apps and LLM workflows

Sources

  1. BerriAI/litellm – GitHub repository and changelog · 2026-05
  2. LiteLLM Proxy Server documentation (config.yaml, virtual keys, fallbacks) · 2026-05
  3. LiteLLM Observability hooks – Langfuse, Helicone, Datadog integration · 2026-04
  4. BerriAI blog – Multi-tenant cost tracking with virtual keys · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call