LITELLM · HOW-TO

Install the LiteLLM gateway: Docker, config.yaml, virtual keys, cost tracking and Langfuse (May 2026)

Guide from empty server to a production LiteLLM proxy with 5 providers (OpenAI, Anthropic, Mistral, Gemini, Ollama), virtual keys with budget limits, PostgreSQL audit log and Langfuse tracing.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is this about?

This guide builds a production LiteLLM proxy on a Linux server (Hetzner, AWS, Render – any cloud). You install LiteLLM via Docker-Compose with PostgreSQL as the audit backend, store API keys for 3-5 LLM providers (OpenAI, Anthropic, Mistral, Gemini, Ollama), configure virtual keys with spend limits per user/team, set up health checks and Langfuse tracing and place the whole thing behind an nginx reverse proxy with auth middleware.

The guide targets three audiences. First: SMEs that want to use multiple LLM providers in parallel (e.g. Mistral for client data, GPT-4o for open research, Claude Haiku for code classification). Second: fiduciary or law firms needing a central audit log for all AI requests (Art. 957a CO). Third: dev teams needing budgets per application and per client (pilot client CHF 50/month, production client CHF 500/month).

Prerequisites: a Linux server with Docker (4 GB RAM, 2 vCPU is enough), a domain on Cloudflare or own DNS, API keys from 3-5 providers (at least OpenAI or Anthropic, plus at least one EU provider like Mistral). Setup time: 3-5 hours net. Cost: server CHF 5-10/month, no LiteLLM licence cost (Apache 2.0 open source).

Why a gateway instead of direct provider calls

Writing applications directly against provider SDKs builds technical debt into every module. Three problems show up in every larger setup, and LiteLLM solves them in one place.

First: lock-in. OpenAI, Anthropic and Mistral have different SDKs, different token limits, different error codes. A codebase talking to all three has three failure surfaces instead of one. When the preferred provider gets more expensive or fails, switching is a refactor, not a config entry. With LiteLLM the provider is a model-config entry in a YAML – application stays untouched.

Second: data-protection routing. A Swiss fiduciary must not send client PII to US providers but can send anonymised research queries to GPT-4o. Without a gateway that rule is scattered in code; with LiteLLM it lives in one place. Model names like mistral-eu-secure, claude-haiku-eu or local-llama mark the tier, the proxy routes accordingly. An application may only talk to model names allowed for its class.

Third: cost and observability. Each provider has its own dashboards, billing cycle, model prices. A cross-provider view – who used how much, which workflow costs what – cannot be built without an aggregation layer. LiteLLM writes every request with model, tokens, cost and latency to Postgres and exports metrics to Prometheus.

The fourth point is spend limit per user. A pilot client gets CHF 50/month budget, a production client CHF 500, an internal research app CHF 200. When budget is exceeded the proxy automatically blocks further requests – the provider master key stays untouched. This is the only clean variant for running multiple clients or applications without uncontrolled cost explosions.

How the stack hangs together

The architecture has five layers: nginx reverse proxy with TLS, LiteLLM proxy container, PostgreSQL for audit log, Langfuse for tracing, Prometheus + Grafana for metrics.

nginx reverse proxy: terminates TLS via Let's Encrypt certificate, forwards llm.your-domain.ch to the LiteLLM container (port 4000). Optionally basic auth or Cloudflare Access tokens for an extra authentication layer.

LiteLLM proxy: ghcr.io/berriai/litellm-stable in version 1.50+ (May 2026 stable). Config in a config.yaml with three core sections. model_list: each model gets a logical name and a provider mapping. Example: `model_name: mistral-eu-secure` points via `litellm_params: model: mistral/mistral-large-latest, api_key: os.environ/MISTRAL_API_KEY` at Mistral La Plateforme. A second entry with the same model_name can point at a Mistral instance in Azure EU – fallback group. litellm_settings: global defaults like set_verbose, drop_params, success_callback. general_settings: master_key (admin token), database_url for Postgres, alerting.

Virtual keys: via the admin UI or the /key/generate API, virtual keys (sk-litellm-xyz) are issued. Each key has: max_budget (e.g. 50.00 USD/month), models (whitelist of allowed models), tpm_limit (tokens per minute), rpm_limit (requests per minute), team_id (grouping). Applications get a virtual key instead of the master key.

PostgreSQL: three core tables. spend_logs (one row per request with model, tokens, cost, latency, user, team). users (user master data with spend total). litellm_verificationtoken (virtual key definitions). With this base Grafana can show per client how much was used, which model, at what time.

Langfuse: open-source observability tool for LLM apps. LiteLLM exports prompt/response via callback to Langfuse – there you see per request the full prompt, the answer, token count and cost. Important for debugging and eval. Langfuse itself is self-hostable (Docker-Compose) or available as a cloud service.

Prometheus + Grafana: LiteLLM exposes a /metrics endpoint with token counter, cost counter, latency histogram per model and user. Prometheus scraper every 30s, Grafana dashboards for "top-10 spending models", "latency per provider", "error rate per model". Alerting on spike: Telegram notification when a user reaches > 80% of monthly budget.

LiteLLM setup in 11 steps

01Step 1 – prepare server and domain: Hetzner CX22 (CHF 5/month), `apt update && apt install docker.io docker-compose-plugin`. Subdomain llm.your-domain.ch in Cloudflare as A record to server IP, NOT proxied (orange cloud off – LiteLLM proxy does not need it).
02Step 2 – collect provider API keys: OpenAI (platform.openai.com/api-keys), Anthropic (console.anthropic.com/settings/keys), Mistral (console.mistral.ai/api-keys), Gemini (aistudio.google.com/app/apikey), Ollama (local endpoint http://ollama:11434 with no key). At least 3 keys recommended.
03Step 3 – create docker-compose.yml: `/opt/litellm/docker-compose.yml` with three services: traefik (or nginx, port 80/443), postgres (image postgres:17-alpine, POSTGRES_DB=litellm), litellm (image ghcr.io/berriai/litellm-stable:main-stable, depends_on postgres, command `--config /app/config.yaml`).
04Step 4 – .env with secrets: `/opt/litellm/.env`: `OPENAI_API_KEY=sk-...`, `ANTHROPIC_API_KEY=sk-ant-...`, `MISTRAL_API_KEY=...`, `GEMINI_API_KEY=AIza...`, `LITELLM_MASTER_KEY=sk-1234-master-...` (self-generated with `openssl rand -hex 32`), `DATABASE_URL=postgres://litellm:pw@postgres:5432/litellm`, `LITELLM_SALT_KEY=...`.
05Step 5 – write config.yaml: `/opt/litellm/config.yaml` with model_list: ```yaml model_list: - model_name: gpt-4o-mini litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY - model_name: claude-haiku litellm_params: model: anthropic/claude-3-5-haiku-20251022 api_key: os.environ/ANTHROPIC_API_KEY - model_name: mistral-eu litellm_params: model: mistral/mistral-large-latest api_key: os.environ/MISTRAL_API_KEY - model_name: gemini-flash litellm_params: model: gemini/gemini-2.0-flash api_key: os.environ/GEMINI_API_KEY - model_name: local-llama litellm_params: model: ollama/llama3.2:8b api_base: http://ollama:11434 general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: os.environ/DATABASE_URL litellm_settings: success_callback: ["langfuse"] drop_params: true ```
06Step 6 – bring stack up: `cd /opt/litellm && docker compose up -d`. Check logs: `docker logs litellm -f`. Expected: "LiteLLM Proxy started on port 4000", "Database connected", models loaded. Check: `curl http://localhost:4000/health/liveliness` → 200 OK.
07Step 7 – test request with master key: `curl http://localhost:4000/v1/chat/completions -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H "Content-Type: application/json" -d '{"model":"claude-haiku","messages":[{"role":"user","content":"Hello"}]}'`. Expect answer in OpenAI format with model, tokens, choices.
08Step 8 – generate virtual key: `curl http://localhost:4000/key/generate -H "Authorization: Bearer $LITELLM_MASTER_KEY" -H "Content-Type: application/json" -d '{"max_budget":50.0,"budget_duration":"30d","models":["mistral-eu","claude-haiku"],"team_id":"team-fairlane","user_id":"app-rag-pilot"}'`. Response contains a key like `sk-litellm-abc...`. Use this key as the API key in the application.
09Step 9 – set up Langfuse tracing: launch Langfuse via Docker-Compose (https://github.com/langfuse/langfuse) on the same network. Generate public + secret key in the Langfuse UI. In LiteLLM .env: `LANGFUSE_PUBLIC_KEY=pk-...`, `LANGFUSE_SECRET_KEY=sk-...`, `LANGFUSE_HOST=http://langfuse:3000`. Restart LiteLLM. Each request lands in Langfuse with prompt, response, tokens, cost.
10Step 10 – nginx reverse proxy with TLS: nginx config in `/etc/nginx/sites-available/litellm`: `server { listen 443 ssl http2; server_name llm.your-domain.ch; ssl_certificate /etc/letsencrypt/live/llm.your-domain.ch/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/llm.your-domain.ch/privkey.pem; location / { proxy_pass http://localhost:4000; proxy_set_header Host $host; proxy_read_timeout 300s; } }`. `certbot --nginx -d llm.your-domain.ch`, then `systemctl reload nginx`.
11Step 11 – monitoring and spend alarms: Prometheus scraper on `http://litellm:4000/metrics`, Grafana dashboard with panels "spend per user", "latency per model", "error rate per provider". Alertmanager rule: when `litellm_spend_user > 0.8 * litellm_budget_user` → Telegram notification. When `litellm_error_rate_provider > 5%` for 10 minutes → Telegram. When `litellm_p99_latency > 5s` → Telegram.

When LiteLLM pays off

LiteLLM pays off as soon as (a) more than one LLM provider is in play, (b) multiple applications or clients share LLM access, or (c) data-protection tiers must be routed.

Typical triggers: a Swiss fiduciary wants Mistral EU for client data and GPT-4o for open research in the same workflow. An SME pilots RAG and a voice bot in parallel – both applications should have separate budgets and model whitelists. An office plans n8n automation plus a chat assistant – both need audit logging and per-client cost reporting.

The proxy also makes sense for a single application when a fallback is desired. Example: the client chat should automatically switch from Anthropic to Mistral on Anthropic outages without the code noticing. That takes 5 lines of YAML in LiteLLM; without a gateway it is a custom piece of code in every service.

When LiteLLM is not the right tool

For a single small application with exactly one provider and no client separation, LiteLLM is overhead without upside. A weekend project against the OpenAI library does not need a proxy.

LiteLLM is also unsuited for extremely latency-critical applications where every millisecond matters. The proxy adds typically 10-30 ms – for voice streaming or real-time bots that can matter. Direct provider integration with dedicated fallback code is the better fit there.

Anyone in Azure-only or AWS-Bedrock-only environments has gateway features there (Azure OpenAI Service with content filtering, AWS Bedrock with Guardrails). If those cover the requirement fully, an additional proxy layer is redundant.

Another pitfall: running LiteLLM without PostgreSQL – audit log and virtual keys are lost on container restart. PostgreSQL is not optional but mandatory for production. Skipping it gives a hipster tool but no compliance.

Trade-offs

STRENGTHS

One API surface for all providers – application code stays stable on provider switch
Virtual keys with budget, whitelist and rate limit per team or client
Central audit log and cost reporting across all providers
Fallback chains and latency routing via YAML config

WEAKNESSES

Adds 10-30 ms latency per request
One more stack component – needs monitoring and updates
YAML config can get hard to read with many models – version control matters
Overhead without benefit in single-provider or single-app projects

FAQ

What overhead does the proxy add?

Typically 10-30 ms of extra latency per request versus a direct provider call, plus minimal CPU overhead. At production load (several hundred requests per hour), a 2-vCPU container is enough; at thousands of requests per minute, you scale horizontally behind a load balancer.

How do I protect the master key?

Three layers. (1) Never commit the master key into code – only in .env with chmod 600 and never in git. (2) Never hand the master key to applications – applications get virtual keys with budget and whitelist. (3) Master key in a password manager (1Password, Bitwarden) plus a print copy in a safe. On suspected leak: generate a new master key, reissue all virtual keys, invalidate the old master key.

Can LiteLLM route embeddings and image models too?

Yes. The proxy supports /v1/embeddings (OpenAI, Cohere, Voyage, local models), /v1/audio/transcriptions (Whisper, Deepgram), /v1/images/generations (DALL-E, Stable Diffusion) and /v1/chat/completions with vision. Re-ranking endpoints are covered too. Config via model_list with the matching endpoint prefix, same as chat models.

Is LiteLLM audit-ready for Art. 957a CO?

The proxy writes every request with timestamp, model, tokens, cost and response hash to PostgreSQL. If that table is WORM-secured (e.g. append-only tablespace plus a regular hash anchor in external storage like Backblaze B2 with object lock), it meets the CO requirements for audit-grade business correspondence. Add a daily audit-trail backup to B2 with 7-year retention.

Sources

BerriAI/litellm – GitHub repository and changelog · 2026-05
LiteLLM Proxy Server documentation (config.yaml, virtual keys, fallbacks) · 2026-05
LiteLLM Observability hooks – Langfuse, Helicone, Datadog integration · 2026-04
Langfuse documentation – self-host and integration · 2026-04
BerriAI blog – Multi-tenant cost tracking with virtual keys · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call