LOCAL LLM RUNTIMES - COMPARISON
Local LLM runtimes compared: Ollama, vLLM, llama.cpp, LM Studio, LocalAI, TGI, GPT4All, KoboldCpp, Jan, OpenLLM
Ten serious runtimes for locally operated language models, from hobby desktop to production GPU serving. Decision matrix as of May 2026.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is a local LLM runtime?
A local LLM runtime is the software layer that loads an open-weight language model (Llama, Mistral, Qwen, DeepSeek) on your own hardware, keeps it in memory and answers requests. Without that layer, the model is just several gigabytes of weight files on disk. The runtime turns those files into a productive service with HTTP API, token streaming, batch processing and multi-user routing.
As of May 2026, about ten serious options exist with clearly different profiles. Some target hobby desktops and Mac notebooks (LM Studio, GPT4All, Jan). Some are production-grade server solutions with throughput optimisation (vLLM, Text Generation Inference, OpenLLM). Some sit in between and cover both worlds (Ollama, LocalAI, KoboldCpp). And one - llama.cpp - is the fundamental C library that most of the others build on.
For Swiss fiduciary and law offices, runtime choice is not a matter of taste. It decides three hard factors: first, whether the setup runs on existing servers (CPU-only Hetzner AX41) at all, or whether a GPU is required. Second, whether OpenAI API compatibility allows existing tools such as LiteLLM, n8n and LangChain to plug in directly. Third, whether the solution scales when two test users grow to thirty productive users - or whether a runtime swap is required as soon as the first load peaks hit.
Why the choice matters
Three hard axes decide suitability: hardware profile, throughput need and integration depth. Pick the wrong runtime and you pay either with unused GPU, or with overloaded CPU servers, or with a code migration six months later.
Hardware profile: a 7B model such as Llama 3.3 8B or Phi-4 runs with 4-bit quantisation on a modern CPU with 16 GB RAM. That is exactly what llama.cpp, Ollama and LM Studio target. As soon as 70B models (Llama 3.3 70B, Qwen 2.5 72B) or production throughput are required, no path avoids a GPU with at least 24 GB VRAM - and therefore vLLM or Text Generation Inference.
Throughput: a hobby runtime such as LM Studio processes one request after another. A production runtime such as vLLM handles dozens of parallel requests per second on the same GPU via continuous batching and PagedAttention. The difference is not 2x but more like 10x to 20x. For a firm with fifteen active staff, this factor decides whether one server is enough or four are needed.
Integration depth: those who deploy Ollama, vLLM, LocalAI or OpenLLM get an OpenAI-compatible API from day one. That makes LiteLLM routing, n8n nodes and LangChain work without adaptation. A hobby runtime with its own API forces adapter coding - three to five extra days every time a new tool is added.
The ten runtimes in detail
Ollama (Go, MIT, all OS): the simplest start in the market. One command, the model launches, an OpenAI-compatible API is on port 11434. GGUF format, automatic model management, well-documented library with Llama, Mistral, Qwen, Gemma, Phi. As of May 2026 Ollama is the de-facto standard for local LLMs in fiduciary setups.
vLLM (Python/CUDA, Apache 2.0, Linux with GPU): the production answer to high throughput. PagedAttention avoids the classic memory waste with long contexts, continuous batching lifts GPU utilisation from 30% to over 80%. For a server with thirty concurrent users, vLLM is up to 20x more efficient than Ollama. Linux only, GPU only, no easy Mac support.
llama.cpp (C/C++, MIT, all platforms): the original library. Ollama, LM Studio and KoboldCpp all sit on llama.cpp. Very portable, runs on CPU, CUDA, Metal, ROCm, Vulkan. Those who want maximum control and minimal footprint build directly on it - price: less comfort, no model manager, manual command line.
LM Studio (Electron, proprietary free, desktop): hobby tool with a graphical UI. Mac, Windows, Linux. Models from Hugging Face with one click, chat interface built in, local API server for development. Good for exploration, demos, personal use. Not built for multi-user production.
LocalAI (Go, MIT, Docker): OpenAI-API-compatible all-rounder. Not only LLMs but also TTS, STT, vision and embeddings under a single API. Whoever needs several modalities locally (speech-to-text for lawyer dictations plus LLM summary) is well served by LocalAI. A bit more configuration effort than Ollama, but broader.
Text Generation Inference (TGI, Rust+Python, Apache 2.0, Hugging Face): production serving straight from the Hugging Face universe. Tensor parallelism, FlashAttention, quantisation. Very good performance, clean docs, compatible with all HF models. Slightly less community momentum than vLLM in 2026, but stable on rare architectures.
GPT4All (C++, MIT, Nomic AI, desktop): hobby tool focused on beginners. Built-in model list, chat UI, optional API mode. Similar in spirit to LM Studio, slightly smaller model selection but leaner.
KoboldCpp (C++, AGPLv3, self-host): llama.cpp fork with its own web UI, aimed at roleplay and story generation. Of little relevance for fiduciary work but with a strong community for creative use cases. Note the AGPL licence - stricter than MIT.
Jan (Electron, AGPLv3, desktop): open alternative to LM Studio. Mac, Windows, Linux. Model browser, chat UI, local API server. As of May 2026 still younger than LM Studio, but gaining ground with users who want to avoid proprietary tools.
OpenLLM (Python, Apache 2.0, BentoML): production self-host with BentoML integration. OpenAI-compatible, good in multi-model setups (several models on one server). A bit more complex than Ollama, but with production features such as health checks, metrics and batch APIs.
Runtime selection in 6 steps
- 01Hardware inventory: GPU present? If yes, VRAM size? If no, how much RAM in the CPU box?
- 02Estimate concurrent load: how many parallel requests expected? Up to 5, Ollama is fine; from 10 on, vLLM or TGI pays off.
- 03Model requirement: 7B model enough (Phi-4, Llama 3.3 8B) or 70B? 70B needs a GPU with at least 48 GB, or 4-bit quantisation + 24 GB.
- 04API compatibility: do you need OpenAI API format for LiteLLM, n8n, LangChain? If yes: Ollama, vLLM, LocalAI, TGI, OpenLLM. Hobby tools only for solo use.
- 05Check multi-modality: do you also need transcription, TTS, vision on the same box? If yes: LocalAI.
- 06Run a PoC: two weekends, install Ollama, load an 8B model, walk through five typical client questions. Only then go productive.
Recommendation by use case
Fiduciary office, 5-15 people, one server, mixed load: Ollama. Quickly set up, OpenAI API available, runs on a workstation with a 24 GB GPU or even CPU-only for smaller models. Default choice in Switzerland in May 2026.
Law firm 30+ people, dedicated GPU server, high concurrent load: vLLM. Pays off from about ten parallel requests per second. Setup effort one to two days, then long-term stable. Requires Linux and CUDA.
Multi-modal local (LLM + transcription + TTS) on one box: LocalAI. Transcribe lawyer dictations via Whisper, summarise with Llama 3, TTS for client callbacks - all in one Docker container, one API.
Personal exploration, Mac notebook, no server: LM Studio or Jan. Click-and-go, models from Hugging Face, ideal to answer "is local even worth it" in one hour.
Maximum hardware leverage, embedded or edge: llama.cpp directly. Whoever pushes a Wi-Fi router, industrial PC or Apple Silicon to peak performance builds on llama.cpp and skips the comfort layer.
Hugging Face-centric team, many exotic models: TGI. Whoever constantly tries new architectures (MoE, multi-vision, long contexts) benefits from the Hugging Face ecosystem.
Production platform with several models per server: OpenLLM. If one server should host Llama-70B for client chat and Phi-4 for fast triage at the same time, the BentoML model from OpenLLM fits better than Ollama.
When a local runtime is wrong
If you have no GPU, no server admin and only occasional LLM use by two or three staff, a cloud API from Anthropic, OpenAI or Mistral is more economical than buying a CHF 3000 GPU plus power. Rule of thumb: local pays off only from about 5 million tokens per month - below that, the effort does not amortise.
If the use case needs same-day currency (market research, live news, latest case law) and no internal documents, a cloud API with a search tool or the Perplexity API is the right path - no local model solves this without RAG.
And: if the compliance argument is the only reason for local, look at EU hosting (Mistral La Plateforme, a Hetzner-deployed Ollama instance). Whoever only needs "data in EU/CH" and not "data in our own rack" saves themselves the operations overhead.
Trade-offs
STRENGTHS
- Full data control - no external API calls for client data
- No token costs after the hardware investment
- Offline-capable, no internet risk
- OpenAI-compatible with Ollama/vLLM/LocalAI/TGI/OpenLLM - existing tools just work
WEAKNESSES
- GPU investment or capable CPU required - from CHF 2000 to over CHF 20000
- Operations overhead: updates, backups, monitoring, model swaps
- Local models still trail the current top Claude model / the current top GPT model on reasoning (as of May 2026)
- Learning curve: two to five days of setup plus ongoing maintenance
FAQ
Is Ollama enough for a 10-person fiduciary?
Yes, in most cases. On a workstation with an RTX 4090 (24 GB VRAM) or a Hetzner GEX44 GPU box, Ollama runs Llama 3.3 70B in 4-bit quantisation. Up to about 5 parallel requests latency is acceptable (3-8 seconds). Beyond that, a switch to vLLM pays off - on the same hardware vLLM reaches 4-5x more throughput.
How do Ollama and llama.cpp differ?
Ollama is a comfort layer over llama.cpp. It supplies a model manager, OpenAI API server, automatic quantisation and a repository (ollama.com/library) out of the box. Pure llama.cpp is the C library underneath - faster and leaner, but every model download, every quantisation, every API server has to be managed by hand. For production: Ollama. For embedded or maximum hardware leverage: llama.cpp.
How many tokens per second are realistic?
Heavily dependent on model size and hardware. Example figures from May 2026 for Llama 3.3 8B in 4-bit quantisation: Apple M3 Max (Metal) via llama.cpp about 60-80 tokens/s for a single request. RTX 4090 via Ollama about 100-130 tokens/s. RTX 4090 via vLLM with batching about 400-600 tokens/s aggregated across all parallel requests. For Llama 3.3 70B in 4-bit quantisation: RTX 4090 via Ollama about 15-22 tokens/s; vLLM on two A100s with tensor-parallel about 90-130 tokens/s aggregated.
Which runtime supports Llama 4 MoE?
As of May 2026, vLLM, Text Generation Inference and Ollama support Llama 4 Scout (17B active) in production. Llama 4 Maverick (109B active) runs on a single H100 (80 GB) with 4-bit Ollama, and across two H100s with vLLM. llama.cpp has had MoE support since early 2026. LM Studio and Jan are catching up. KoboldCpp and GPT4All are not there yet.