VLLM · TECH
vLLM: production serving for open-weight LLMs with high throughput and PagedAttention
vLLM is an Apache 2.0 inference server for Linux with GPU. PagedAttention and continuous batching deliver up to 20x more throughput than hobby runtimes.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is vLLM?
vLLM is an open-source inference server for large language models, originally developed at UC Berkeley and freely available under Apache 2.0 as of May 2026. The project lives at github.com/vllm-project/vllm with over 30,000 GitHub stars, more than 800 contributors and its own foundation structure (vLLM is now part of the Linux Foundation AI & Data).
The core: vLLM is specialised in GPU-based production serving. Unlike Ollama, llama.cpp or LM Studio, which primarily serve single users and hobby setups, vLLM was designed from day one for multi-user load. Two techniques make the difference: PagedAttention manages the transformer KV cache like memory pages in an operating system and avoids the usual fragmentation loss of 30-60 percent. Continuous batching merges incoming requests into a single GPU batch every second, so the GPU runs at 80-95 percent utilisation instead of 30 percent.
Version state May 2026: vLLM v0.6+ is the productive line. The engine supports more than 50 model architectures, including Llama 3.3 / Llama 4 Scout and Maverick, Mistral Large 2 and Small 3.1, Qwen 2.5 / Qwen 3, DeepSeek V3 / V4, Gemma 3, Phi-4, Apertus 8B and 70B, plus vision-language models such as LLaVA and Qwen2-VL. Quantisation formats: AWQ, GPTQ, INT8, FP8, limited GGUF, Marlin kernel. Hardware support: NVIDIA CUDA (Ampere, Hopper, Blackwell), AMD ROCm (MI250, MI300), AWS Inferentia, Google TPU, Intel Gaudi.
The interface is OpenAI-compatible: POST /v1/chat/completions and /v1/completions, plus native endpoints under /generate. So existing OpenAI SDKs, LiteLLM, n8n and LangChain work without adaptation. A vLLM server can be launched with a single Docker command or installed via pip and started with "python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-4-Scout-17B".
Why vLLM matters for Swiss data
Three reasons make vLLM the first choice as soon as a Swiss law firm or fiduciary office moves beyond pilot status.
First: GPU efficiency becomes the cost lever. An H100 80GB costs around CHF 3,500-5,500 per month at Hetzner GEX44 or Infomaniak. On this GPU, Ollama with Llama 3.3 70B in 4-bit quantisation serves about 3-5 parallel users with acceptable latency. vLLM on the same hardware serves 30-50 parallel users with the same model. Anyone wanting to serve fifteen lawyers or thirty fiduciary staff on own hardware cannot avoid vLLM – or has to budget four times the GPU spend.
Second: data sovereignty stays complete. vLLM runs in your own rack or on a dedicated GPU instance at Infomaniak in Geneva or at a German host like Hetzner. No request leaves the defined processing perimeter. Relevant for clients under professional secrecy per Art. 321 SCC, for FINMA-supervised entities with SN 08/2024 exposure, and for EU AI Act Art. 10 data-governance requirements.
Third: routing capability via LiteLLM. vLLM is OpenAI-compatible – so a vLLM server can be entered as a provider in LiteLLM and routed alongside Mistral Cloud, the Apertus Swisscom API and Claude. A typical Swiss multi-provider strategy in May 2026 looks like this: highly sensitive requests to on-premises vLLM with Apertus 70B, medium-sensitive to Mistral La Plateforme in the EU, top reasoning cases to the current top Claude model under an FADP-compliant contract. The switch is a routing rule, not a code refactor.
Fourth, often overlooked: vLLM documents performance well. Logs show time-to-first-token, inter-token latency, per-request GPU utilisation. These metrics land via Prometheus in Grafana and thus in audit-ready form – important for FINMA SN 08/2024 Pillar 3 (robustness) and EU AI Act Art. 15 logging duties.
How vLLM works technically
vLLM is a Python package with C++/CUDA kernels. Server mode starts a worker process per GPU or per tensor-parallel group. Requests arrive via FastAPI, land in a request queue, and the scheduler loop groups them into micro-batches.
Setup example. A productive setup on a single H100 80GB looks like this:
``` docker run --gpus all -p 8000:8000 \ -v /models:/models \ vllm/vllm-openai:v0.6.3 \ --model meta-llama/Llama-4-Scout-17B-Instruct \ --max-model-len 131072 \ --gpu-memory-utilization 0.92 \ --enable-prefix-caching \ --api-key sk-fairlane-prod-key ```
This command loads Llama 4 Scout with 128k context, reserves 92 percent of GPU memory for model and KV cache, and enables prefix caching (answers for repeated system prompts come from cache).
PagedAttention in detail. The transformer KV cache grows linearly with input length. Classical implementations reserve the cache as one contiguous block – variable-length requests create holes (fragmentation). PagedAttention splits the cache into fixed blocks (typically 16 tokens) and manages them via a table, similar to OS virtual memory. Result: usable GPU memory rises from 60-70 percent to over 95 percent.
Continuous batching. A classical server processes requests FIFO. While request A generates 200 tokens, request B waits. vLLM mixes incoming requests token by token: if the GPU is at token 150 of request A in the same forward pass, request B joins on the next step. This raises utilisation drastically.
Tensor parallelism and pipeline parallelism. For models that do not fit on a single GPU (Llama 4 Maverick 400B, Apertus 70B in fp16), vLLM distributes layers across GPUs. "--tensor-parallel-size 2" on two H100s starts Maverick in 4-bit AWQ quantisation at around 40-60 tokens/sec aggregated.
Monitoring. vLLM exports Prometheus metrics on /metrics: vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:time_to_first_token_seconds, vllm:e2e_request_latency_seconds. These land in Grafana with alerts on p95 latency and queue waiting time. Logs come via stdout into the Docker stack and onward to Loki.
vLLM to production in 5 steps
- 01Hardware check: GPU with at least 24 GB VRAM, CUDA 12.4+, Linux (Ubuntu 22.04 or Debian 12). Verify driver version with "nvidia-smi".
- 02Model choice and quantisation: Apertus 70B in 4-bit AWQ for Swiss sovereignty, Llama 4 Scout for long context, Mistral Small 3.1 for EU DE/FR/IT, Phi-4 for minimal VRAM. Load quantised variants from Hugging Face.
- 03Start the Docker container: vllm/vllm-openai:v0.6.3 with model, max-model-len, gpu-memory-utilization 0.90-0.92 and a defined API key.
- 04LiteLLM wiring: enter vLLM as an OpenAI-compatible provider in the LiteLLM config.yaml, logical name (e.g. local-apertus-70b) and routing rules defined.
- 05Monitoring: Prometheus scraper on /metrics, Grafana dashboard with p95 latency, GPU cache usage and queue depth. Alerts from 1500ms p95 and 80 percent GPU cache.
When to use vLLM
vLLM is the right choice when three conditions meet: (a) a GPU with at least 24 GB VRAM is available, (b) ten or more parallel requests per second are expected, (c) Linux is the standard operating system.
Concrete cases: law firm with 20+ staff and an internal chat front-end for client correspondence – vLLM on 2x H100 with Apertus 70B or Llama 4 Maverick. Fiduciary firm with high batch volume (10,000+ receipts per day with OCR and classification) – vLLM on an L40S 48GB with Phi-4 or Apertus 8B. Insurer with claims triage and high throughput needs – vLLM on RTX A6000 with Mistral Small 3.1.
For production use, vLLM also pays off when latency requirements are strict. Time-to-first-token below 300ms is comfortably achievable on an H100 with Llama 4 Scout – Ollama under comparable load sits at 800-1500ms.
When not to use
Without a GPU, vLLM is not the right path. A CPU-only installation runs technically but is slower than llama.cpp or Ollama on the same hardware – vLLM is optimised for GPU kernels.
If the load is below five parallel requests per second, the setup effort for vLLM is not justified. Ollama with Llama 3.3 70B on the same H100 is up in an hour; vLLM needs one to two days for a productive configuration with monitoring and alerts.
For Mac notebooks and Apple Silicon setups, vLLM is not relevant – the CUDA dependency rules out Apple Metal. Here Ollama, LM Studio and llama.cpp are the right tools.
For hobby exploration or weekend projects without a clear scaling goal, Ollama is simply more pleasant. vLLM pays off from the moment throughput, stability and observability become hard requirements.
Trade-offs
STRENGTHS
- Up to 20x more throughput than hobby runtimes on the same GPU thanks to PagedAttention and continuous batching
- Apache 2.0 licence, fully self-hostable, no provider lock-in
- OpenAI-compatible API integrates cleanly with LiteLLM, n8n, LangChain
- Production metrics via Prometheus for FINMA SN 08/2024 and EU AI Act audit trails
WEAKNESSES
- GPU is mandatory, no productive CPU mode, no Apple Silicon support
- Setup effort of one to two days for a productive configuration with monitoring
- Model updates and quantisation choice must be actively maintained
- Driver and CUDA version compatibility can cause friction with new GPU generations
FAQ
How much faster is vLLM really than Ollama?
Per single request, about 1.5x to 2x. With many parallel requests, however, dramatically more – typical measurement May 2026 on an H100 with Llama 3.3 70B in 4-bit: Ollama serves 4-5 users with p95 latency under 6 seconds; vLLM serves 35-45 users at the same p95 mark. The lever is continuous batching, not single-stream speed.
Can I use vLLM with AMD GPUs?
Yes, ROCm 6.0+ is productively supported as of May 2026. Tested hardware: MI250, MI300X. Performance on MI300X is comparable to H100, but driver maturity is still lower. For standard Swiss setups, NVIDIA remains the default – Hetzner and Infomaniak both offer H100 and L40S.
Which models run best under vLLM?
Llama 3.3, Llama 4 Scout and Maverick, Mistral Large 2 and Small 3.1, Qwen 2.5 and Qwen 3, DeepSeek V3 and V4, Apertus 8B and 70B, Phi-4, Gemma 3. Vision-language: LLaVA and Qwen2-VL. Weakly supported are very small, exotic architectures – these traditionally ran better under Hugging Face Text Generation Inference.
Do I need a tensor-parallel setup for 70B models?
Not necessarily. Apertus 70B in 4-bit AWQ fits one H100 80GB with 45-50 GB VRAM. Tensor parallel across two GPUs pays off with full FP16 precision (140 GB needed) or for higher throughput. As of May 2026 the most common Swiss configuration is a single H100 with 4-bit quantisation – covers most fiduciary and law-firm workloads.