fairlane.systems

OLLAMA vs vLLM vs LLAMA.CPP - DUEL

Ollama vs vLLM vs llama.cpp - which local LLM server?

Three open-source runtimes for local language models. Ollama for onboarding, vLLM for production throughput, llama.cpp as the portable foundation - decision matrix as of May 2026.

Researched & fact-checked by: · As of: 2026-05

What is the duel about?

Three names appear in every conversation about local language models: Ollama, vLLM and llama.cpp. At first glance all three look like competitors - in fact they occupy different layers of the same stack. llama.cpp is the fundamental C++ library that loads a model into memory and runs inference on CPU, GPU, Metal or Vulkan. Ollama is a Go-based comfort layer on top of llama.cpp that adds model management, a REST API and a repository. vLLM is a standalone Python runtime from UC Berkeley designed for high-throughput GPU production serving - different codebase, different optimisations.

The choice answers three hard questions. First: how much setup effort do you invest? Ollama starts with one command, vLLM needs one to two days of configuration, raw llama.cpp requires manual model conversion. Second: how many parallel requests must the server handle? Ollama hits its limits around five concurrent requests, whereas vLLM - thanks to PagedAttention and continuous batching - keeps scaling to thirty or more. Third: what hardware is available? llama.cpp is the only option for pure CPU setups, for Apple Silicon with Metal and for edge devices. vLLM requires Linux with a CUDA GPU. Ollama covers CPU+GPU and runs on Mac, Windows and Linux.

Why the choice matters

The runtime decides three cost axes that a fiduciary or law office feels directly. First: hardware cost. A Llama 3.3 70B in 4-bit format runs on an RTX 4090 (24 GB) with Ollama at around 18 tokens per second for a single request. On the same card vLLM with continuous batching aggregates roughly 120-180 tokens per second across all concurrent requests. With fifteen active users you need four servers under Ollama and one under vLLM.

Second: setup and maintenance cost. Ollama installs in thirty minutes, an 8B model is live an hour later. vLLM is not hard but unforgiving - wrong CUDA version, wrong PyTorch build, wrong tensor-parallel value and the server crashes at boot. Realistic budget: one to two person-days of setup plus ongoing attention for updates. Raw llama.cpp is deeper still - every model swap requires a GGUF quantisation run, every API endpoint its own implementation.

Third: integration cost. As of May 2026 all three runtimes ship an OpenAI-compatible API - Ollama on port 11434, vLLM on 8000, llama.cpp via its built-in server. That makes LiteLLM, n8n and LangChain work without adaptation. But anyone who wants to host several models per server (Llama-70B for client chat, Phi-4 for fast triage) hits Ollama on its one-loaded-model-per-process limit, while vLLM and llama.cpp allow multi-model setups - with a little more configuration discipline.

The three runtimes in detail

Ollama (Go, MIT, de-facto standard as of May 2026): a single command - ollama run llama3.3 - loads the model, starts the daemon and opens the API. Configuration lives in a simple Modelfile; the repository at ollama.com/library covers Llama, Mistral, Qwen, Gemma, Phi and over 100 more models. Quantisation happens automatically in the background. Mac with Metal acceleration, Windows via WSL2 or native, Linux with CUDA or ROCm. Important for SME setups: Ollama keeps a model in memory during idle time and unloads it after a configurable wait - that conserves VRAM in multi-model use.

vLLM (Python+CUDA, Apache 2.0, Linux only): the production answer to throughput problems. Two core optimisations lift vLLM above the rest. PagedAttention manages the KV cache the way an operating system manages virtual memory pages - no waste, long contexts are feasible. Continuous batching appends new requests to an in-flight generation instead of serialising them. Result: 80-90 percent GPU utilisation instead of the 30-40 percent typical for Ollama. Price: vLLM requires a compatible CUDA version, a GPU with at least 16 GB VRAM, and Linux. No Mac, no easy Windows path.

llama.cpp (C++, MIT, all platforms): the library underneath. Written by Georgi Gerganov, originally for Apple Silicon inference. GGUF format, which has become the de-facto standard for quantised models - 4-bit, 5-bit, 8-bit quantisation with measurable quality trade-offs. Backends for CPU with AVX2/AVX-512, CUDA, Metal, Vulkan and ROCm. A built-in HTTP server delivers OpenAI-compatible endpoints. Anyone chasing maximum hardware efficiency on unusual hardware (Apple M3 Max, old Xeon servers, ARM servers) builds on llama.cpp directly - and accepts that every convenience has to be built by hand.

Selection in 5 steps

  1. 01Hardware inventory: which box is available? CPU only, Apple Silicon, NVIDIA GPU with how much VRAM?
  2. 02Estimate concurrent load: 1-5 parallel users = Ollama; 10-30 = vLLM; edge / Apple = raw llama.cpp.
  3. 03Model requirement: 7B-13B for standard triage = any runtime; 70B for serious tasks = GPU with 24+ GB VRAM or 4-bit quantisation.
  4. 04PoC with Ollama: install in two days, load an 8B model, test five typical client queries. Learn before any production migration.
  5. 05Production migration only on proven load problems: vLLM pays off only beyond sustained eight concurrent requests, not on spikes.

Recommendation by scenario

Fiduciary 5-15 people, mixed load, one server: Ollama. Setup effort is one hour, the OpenAI API works from day one, LiteLLM integration without a code change. On a workstation with an RTX 4090 or a Hetzner GEX44 (RTX 6000 Ada), Ollama serves up to five parallel users with Llama 3.3 70B in 4-bit without visible latency issues. The typical May-2026 default in Swiss SMEs.

Law firm or hosting provider with 30+ active users: vLLM. The switch pays off from about ten parallel requests per second. On an A100 80 GB, vLLM with Llama 3.3 70B in FP8 quantisation delivers about 400-500 tokens per second aggregated - four to five times more than Ollama on the same card. Setup one to two days, then stable for the long haul. Requires Linux, CUDA 12.x, GPU with 24+ GB VRAM.

Apple Silicon, edge hardware, unusual architectures: llama.cpp. On a MacBook Pro M3 Max with 64 GB unified memory, Llama 3.3 70B in 4-bit quantisation runs at roughly 8-12 tokens per second - enough for solo use without a GPU server. Anyone fitting an ARM box, industrial PC or Raspberry Pi 5 with a local LLM has no serious option besides llama.cpp.

Multi-model on one server: vLLM with a model tower or several llama.cpp instances on different ports. Ollama does not support that cleanly - it loads multiple models, but only one stays active in VRAM.

Maximum reasoning throughput on long contexts: vLLM with PagedAttention. Anyone who regularly uses 128k-token contexts (long contracts, several case files in parallel) feels the difference to Ollama clearly here.

When none of these runtimes fits

If the cloud variant is enough and no compliance argument speaks against Anthropic-EU, OpenAI-EU or Mistral La Plateforme, a local runtime is the wrong lever. Rule of thumb: below about 5 million tokens per month the GPU investment does not amortise. An RTX 4090 costs around CHF 2200, an A100 80 GB over CHF 18000 - many tokens at Anthropic API rates.

If the use-case profile is mostly short, isolated requests with long idle gaps between them, GPU utilisation is poor and the setup effort is not justified. vLLM shines only under sustained high concurrent load, and Ollama has no throughput edge over a cloud API for single requests.

If the reasoning level of current proprietary models (the current top Claude model, the current top GPT model) is strictly required - certain legal research tasks, complex tax decisions - local open-weight models in May 2026 do not reach it. Llama 3.3 70B and Qwen 2.5 72B are excellent for standard tasks but still trail the frontier models on multi-step reasoning. Anyone needing the maximum runs a hybrid strategy: local runtime for 80 percent of requests, cloud API for the other 20 percent.

Trade-offs

STRENGTHS

  • Ollama: easiest entry on the market - one hour from download to production
  • vLLM: up to 20x more throughput on the same GPU thanks to PagedAttention and continuous batching
  • llama.cpp: only option for Apple Silicon, ARM, edge hardware and CPU-only setups
  • All three: open source, OpenAI-compatible API, no token cost after the hardware investment

WEAKNESSES

  • Ollama: throughput ceiling around 5 concurrent requests, no real multi-model hosting
  • vLLM: Linux+CUDA only, one to two days of production setup, less out-of-the-box model variety
  • Raw llama.cpp: no comfort layer, every quantisation and every API endpoint built by hand
  • All three: as of May 2026, local models still trail the current top Claude model / the current top GPT model on complex reasoning

FAQ

Is llama.cpp really under Ollama?

Yes. Ollama is a Go wrapper that loads llama.cpp as a shared library and adds a model manager, REST API and configuration tools around it. Switching from Ollama to raw llama.cpp leaves model performance essentially unchanged - what disappears are convenience features such as automatic model updates and the ollama-pull repository.

Can I switch from Ollama to vLLM without code changes?

If your code addresses the OpenAI API via LiteLLM, LangChain or your own SDK: yes, largely. You change the base URL and possibly the model name. Watch model-specific parameters (tool calling, JSON mode) - vLLM and Ollama implement them differently. As of May 2026 compatibility is good but not one hundred percent. One day of integration testing is realistic.

How many tokens per second are realistic?

Example figures from May 2026 for Llama 3.3 70B in 4-bit quantisation on an RTX 4090: Ollama about 15-22 tokens per second for a single request. vLLM with continuous batching on the same card about 100-140 tokens per second aggregated across all parallel requests. On a MacBook Pro M3 Max via llama.cpp about 8-12 tokens per second. On an A100 80 GB via vLLM in FP8: 400-500 tokens per second aggregated.

What about Llama 4 MoE?

As of May 2026 all three runtimes support Llama 4 Scout (17B active parameters) in production. Llama 4 Maverick (109B active, over 400B total) runs on a single H100 80 GB with Ollama in 4-bit, and across two H100s with vLLM via tensor parallelism. llama.cpp has had stable MoE support since early 2026.

Related topics

LOCAL LLM RUNTIMES - COMPARISONLocal LLM runtimes compared: Ollama, vLLM, llama.cpp, LM Studio, LocalAI, TGI, GPT4All, KoboldCpp, Jan, OpenLLMOLLAMA · TECHOllama: local LLMs on your own hardware – where it works and where it does notSELF-HOSTED OLLAMA · LLM PROVIDERSelf-hosted Ollama as an LLM provider: when does it replace OpenAI, Anthropic or Gemini?SELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciariesQUANTISATION · AI CONCEPTWhat is quantisation? Compressing model weights without quality loss

Sources

  1. Ollama - official documentation & model library · 2026-05
  2. vLLM - high-throughput LLM serving (docs) · 2026-05
  3. llama.cpp - GitHub repository · 2026-05
  4. Kwon et al. - PagedAttention paper (vLLM) · 2023-09

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call