LLAMA.CPP · TECH

llama.cpp: the portable C/C++ inference library under Ollama, LM Studio and KoboldCpp

llama.cpp is the MIT-licensed base library for local language models. Runs on every platform – CPU, CUDA, Metal, ROCm, Vulkan. GGUF format standard.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is llama.cpp?

llama.cpp is an inference library for large language models, written in pure C/C++ without external runtime dependencies. Started in March 2023 by Georgi Gerganov, the project is published under the MIT licence at github.com/ggerganov/llama.cpp. As of May 2026 it has over 70,000 GitHub stars, more than 1,000 contributors, and counts as the reference implementation for local LLM inference on heterogeneous hardware.

The key point about llama.cpp: it is the library under the library. Ollama uses llama.cpp as its inference engine. LM Studio is built on it. KoboldCpp is a direct fork. GPT4All uses an adapted variant. Whoever installs a comfort layer like Ollama gets llama.cpp implicitly. Whoever wants maximum control over quantisation, compilation and hardware leverage goes directly to llama.cpp.

The supported hardware range is unrivalled in the market. CPU inference on x86-64 (AVX2, AVX512, AMX), ARM (NEON, SVE, Apple Silicon with AMX kernel), POWER and even RISC-V. GPU acceleration via CUDA (NVIDIA), Metal (Apple), ROCm (AMD), Vulkan (vendor-neutral, runs on Intel Arc and older AMD cards too), SYCL (Intel), MUSA (Moore Threads). As of May 2026, llama.cpp also experimentally supports Qualcomm Hexagon NPU for smartphone inference.

The GGUF format (GPT-Generated Unified Format) was defined in 2023 by the llama.cpp project and is by May 2026 the de-facto standard for quantised open-weight models on Hugging Face. Every model intended for local use – Llama, Mistral, Qwen, DeepSeek, Apertus, Phi-4 – exists in GGUF quantisations Q2_K, Q3_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0 and FP16. Q4_K_M is the established sweet spot with at most 2-3 percent quality loss at a file size of roughly 60 percent of FP16.

Version May 2026: b3500+ (the project uses rolling build numbers rather than semver). Active development with several releases per week.

Why llama.cpp matters for Swiss data

Four reasons make llama.cpp important for Swiss data processing – even when most offices end up using Ollama.

First: hardware independence. llama.cpp runs on truly any platform. A fiduciary office that wants to run LLM-based receipt classification on an old Hetzner server without GPU is better served by llama.cpp than by any GPU-centric solution. A lawyer with a Mac Mini M2 as personal workstation gets productive speeds via Metal acceleration (50-80 tokens/s on Llama 3.3 8B in Q4_K_M).

Second: predictable behaviour. llama.cpp is compiled by you – the binary is a single self-contained executable without Python runtime, without Docker, without external libraries. Relevant for compliance audits: what is in the code is what is on the machine. No surprise updates, no hidden telemetry connections. For clients under professional secrecy per Art. 321 SCC, this transparency is worth more than convenient auto-update logic.

Third: quantisation control. Using the llama.cpp tool quantize, an original model from Hugging Face (such as Apertus-70B in FP16) can be converted into any quantisation – Q2_K for minimal RAM, Q4_K_M for the standard trade-off, Q8_0 for maximum precision on less RAM-efficient hardware. Ollama serves only a few standard variants; with pure llama.cpp everything is possible.

Fourth: embedded and edge. Anyone needing LLM-based claims triage in a Swiss pharma device (without internet connectivity), setting up a FINMA-compliant local reasoning system in a bank branch, or building a forensic hospital server with local diagnostic support without cloud goes directly to llama.cpp. Footprint requirements here are so strict that any extra layer (Python, Go daemon, Docker) is excluded.

Fifth, often overlooked: GGUF format knowledge is durable. Whoever works with llama.cpp learns the format and the quantisation methodology. This knowledge outlives any comfort layer – if Ollama becomes outdated in two years or LM Studio changes its licence, llama.cpp and GGUF remain usable.

How llama.cpp works technically

llama.cpp has three layers: a C library (libllama), a set of CLI tools and an HTTP server (llama-server).

Build example. On a Linux server with NVIDIA GPU the build looks like this:

``` git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j 16 ```

This produces the tools in build/bin/: llama-cli (interactive chat), llama-server (OpenAI-compatible HTTP server), llama-quantize (quantisation), llama-bench (performance measurement), llama-perplexity (quality scoring).

Server launch. A productive server with Apertus 8B in Q4_K_M:

``` ./build/bin/llama-server \ -m models/apertus-8b-q4_k_m.gguf \ --host 0.0.0.0 --port 8080 \ --ctx-size 8192 \ --n-gpu-layers 32 \ --api-key sk-fairlane-local ```

The interface is OpenAI-compatible on POST /v1/chat/completions plus a native endpoint /completion. This makes llama-server a direct alternative to Ollama, with the difference that no model manager runs alongside.

Quantisation in detail. The tool llama-quantize converts a model from FP16 into any target quantisation. Q4_K_M uses 4 bits per weight plus an extra 5-6 bits for scaling factors in K-quants – result: an 8B model shrinks from 16 GB FP16 to around 4.8 GB Q4_K_M. For larger models (Apertus 70B) the absolute saving is dramatic: 140 GB FP16 to 42 GB Q4_K_M.

Hardware-specific builds. Compile flags decide performance. GGML_CUDA=ON for NVIDIA, GGML_METAL=ON on Apple Silicon, GGML_HIPBLAS=ON for AMD ROCm, GGML_VULKAN=ON as a vendor-neutral GPU option, GGML_SYCL=ON for Intel Arc. On x86 CPUs every SIMD instruction counts: GGML_AVX512=ON, GGML_AMX_INT8=ON on the newest Intel Xeon generations.

Tool use and function calls. llama-server supports native tool calls as of May 2026 for models with corresponding training (Llama 3.3+, Mistral Small 3.1+, Qwen 3, Apertus 70B-Instruct). The interface matches the OpenAI specification.

Speculative decoding. A feature not directly available in Ollama: llama-server can run a small "draft model" in parallel that proposes tokens, which the large model only verifies. With a fitting model combination (such as Apertus 8B as draft for Apertus 70B), throughput rises 30-60 percent with no quality loss.

llama.cpp to production in 5 steps

01Prepare the build environment: cmake, C++ compiler (gcc 12+ or clang 16+), and per-hardware CUDA toolkit / ROCm / Metal / Vulkan SDK.
02Clone the repository and compile hardware-specifically: cmake -B build -DGGML_CUDA=ON or -DGGML_METAL=ON, then cmake --build build -j.
03Get the model: load a GGUF variant from Hugging Face (TheBloke, bartowski or a direct model repository like swiss-ai/Apertus-8B-GGUF), Q4_K_M as default.
04Launch llama-server with -m model, --host 0.0.0.0, --ctx-size matching the use case (4096 for chat, 32768 for document analysis), --n-gpu-layers as high as feasible.
05Set up a systemd service for auto-restart, run llama-bench for a performance baseline, point a Prometheus scraper at the /metrics endpoint.

When to use llama.cpp

llama.cpp is the right choice when (a) maximum hardware control is desired, (b) embedded or edge setups without Docker are needed, (c) a specific quantisation is required, or (d) Mac notebook with Apple Silicon is the target system.

Concrete cases: fiduciary office with an old CPU-only Hetzner server that should run LLM classification without GPU upgrades – llama.cpp with AVX2 optimisation and a 7B model in Q4_K_M handles 100-500 classifications per hour. Law firm with Mac Studio M3 Ultra as power-user workstation – llama.cpp with Metal build reaches 90-130 tokens/s on 8B models and 30-45 tokens/s on 70B models in Q4. Hospital diagnostic support without internet – llama.cpp as single binary on an embedded Linux device.

For performance tuning, llama.cpp is also important: anyone wanting to improve Ollama performance learns the levers here – context size, GPU layer split, batch size, thread count, speculative decoding. These insights transfer to every comfort layer above.

When not to use

For the classical productive multi-user server in a fiduciary with 10-30 staff, Ollama is the more convenient path. Model management, auto-update, persistent hot-loaded models – all included. Pure llama.cpp demands more discipline.

For high throughput requirements on GPU (more than 10 parallel requests per second), vLLM or Text Generation Inference is the right answer. llama-server handles parallel requests but nowhere near as efficiently as vLLM with PagedAttention.

For Windows server setups in regulated Swiss banks – where corporate OS is the standard and Linux is not approved – llama.cpp can be built but the maintenance effort on Windows is high. A cloud API with FADP contract or a Windows-certified commercial solution is cleaner here.

For hobby exploration without compile experience, llama.cpp is the steeper learning curve. LM Studio or Ollama offer the same llama.cpp core with a graphical interface and model browser.

Trade-offs

STRENGTHS

Runs on any hardware – CPU, NVIDIA, AMD, Intel, Apple, RISC-V, even smartphones
Single-binary deployment with no external runtime dependencies
Full quantisation control from Q2 to FP16
GGUF is the de-facto standard for local model distribution as of May 2026

WEAKNESSES

No integrated model manager – models must be managed manually
Multi-user throughput below vLLM level, no continuous batching
Compile step demands a small amount of DevOps discipline
High release frequency requires an update strategy

FAQ

Why use llama.cpp instead of Ollama?

Three reasons: maximum hardware control, custom quantisations, and single-binary deployment without Docker. Whoever wants maximum performance on an Apple Silicon Mac compiles llama.cpp with -DGGML_METAL=ON and gains 15-25 percent over Ollama defaults. Whoever wants to run an LLM on an embedded Linux device has a single file with no runtime dependencies with llama.cpp.

What is the difference between Q4_K_M and Q5_K_M?

Q4_K_M uses 4-bit quantisation with the K-quants method – file size around 60 percent of FP16, quality loss under 3 percent in typical benchmarks. Q5_K_M uses 5-bit quantisation – file size around 70 percent of FP16, quality loss under 1 percent. For 7B models, Q4_K_M is the usual sweet spot; for 70B models where every percent counts, Q5_K_M or Q6_K is worth it if the extra RAM is available.

Does llama.cpp run on Raspberry Pi?

Yes. On a Raspberry Pi 5 with 8 GB RAM, Phi-3 mini (3.8B) in Q4_K_M runs at around 4-7 tokens per second. Apertus 8B is possible with active swap but not productively fast. Realistic use cases on Raspberry: small classification jobs, local smart-home assistance, embedded demo setups. For serious fiduciary or law-firm use, a Pi is not enough.

Does llama.cpp support Llama 4 and Apertus?

Yes, both. Llama 4 Scout and Maverick have been productively supported since build b3200 (mid-April 2026), including the MoE architecture. Apertus 8B and 70B have been supported since September 2025, with specific optimisations for the Romansh vocabulary parts. GGUF quantisations of both models are available on Hugging Face.

Sources

llama.cpp – official GitHub repository and documentation · 2026-05
llama.cpp releases and build matrix · 2026-05
GGUF format specification (Hugging Face) · 2026-04
GGUF quantisation overview and benchmarks · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call