QUANTISATION · AI CONCEPT

What is quantisation? Compressing model weights without quality loss

Quantisation stores model weights in fewer bits. Q4_K_M shrinks Llama-70B from 140 GB to 42 GB at under 2% quality loss.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is quantisation?

Quantisation is a technique that stores neural network weights with fewer bits per number. A typical large language model is computed during training with FP32 (32-bit floats) or FP16/BF16 (16-bit floats). Quantisation converts these weights after training into smaller data types: INT8 (8 bit), INT4 (4 bit), or mixed schemes like Q4_K_M that store parts of the model with different bit depths.

The idea: a language model is robust to small perturbations of its weights. A weight value computed during training as 0.347291 still delivers an almost identical result as 0.34 or 0.3. So by systematically reducing precision, the model shrinks – at small quality cost.

In May 2026 quantisation is standard for inference, not a special trick. Llama 3.1 70B in FP16 occupies 140 GB (70 bn params × 2 bytes). In Q4_K_M it is 42 GB – runs on a single A100 80GB GPU instead of a cluster. On consumer hardware: Llama 3.1 8B in Q4_K_M is 4-5 GB, runs on an M2 MacBook with 16 GB RAM without issue. Llama 3.1 70B in Q4_K_M can be run on 2x RTX 4090 (24 GB each) or a Mac Studio with 64 GB+ RAM.

The main quantisation schemes in May 2026: GGUF (the format of llama.cpp and Ollama, very widely supported, with various Q levels Q2_K to Q8_0), AWQ (Activation-aware Weight Quantization, Lin et al. 2023, good for vLLM inference), GPTQ (Frantar et al. 2022, older but still used in vLLM and text-generation-inference), BitsAndBytes (NF4 for training, INT8 for inference, well integrated into Hugging Face Transformers).

Why it matters now

Quantisation solves the hardware problem for self-hosted models. Three concrete effects.

First: self-hosting becomes affordable. Without quantisation a 70B-parameter model needs several A100/H100 GPUs – investment CHF 100,000+. With Q4_K_M one A100 80GB or two consumer GPUs (2x RTX 4090, total CHF 4000-5000) suffice. For an SME that must process client data exclusively on-prem (professional secrecy SCC 321, revFADP sensitivity, or simply a Swiss hosting preference), quantisation is the lever that makes an own model economical.

Second: cloud inference gets cheaper. Cloud providers also use quantisation. Together.ai, Replicate, Groq and others offer Llama 3.1 70B in Q4 variant at CHF 0.0005-0.002 per 1000 tokens in May 2026 – 5-10x cheaper than the FP16 variant. For applications where slight quality loss is acceptable (classification, triage, routine answers), quantised cloud inference is a concrete cost optimisation.

Third: edge capability. Quantised models run on mobile devices and embedded setups. As of May 2026 Llama 3.2 3B in Q4 runs on an iPhone 15 Pro or a current Android flagship (15-30 tokens/second generation). This enables offline AI in apps, data-sovereign mobile solutions, or local preprocessing before a cloud call.

Quality trade-off is small but not zero. Standard benchmarks May 2026: Q8_0 loses under 0.5% accuracy vs FP16. Q4_K_M loses 1-2%. Q4_0 (older simple 4-bit scheme) loses 3-5%. Q3_K_M loses 5-10% and is sensible only for very restrictive hardware. Q2_K is mostly over-quantised – model quality drops noticeably.

Important distinction: inference quantisation (described here) is post-training and does not change the original model. Training quantisation (QLoRA, NF4 adapters) is a different discipline and concerns fine-tuning (see was-ist-fine-tuning-vs-rag).

Technology in detail

Quantisation follows a simple core principle with various refinements.

Core principle – linear quantisation. Each weight block is mapped to a range [min, max]. The range is divided into 2^bits equal steps. Every weight is rounded to the nearest step. At inference the rounding is undone (dequantize); multiplication happens in high precision. This works because neighbouring weight values produce similar effects – small rounding errors partly cancel.

Block-wise quantisation. Instead of one min/max for the whole model, each block (e.g. 32 or 64 weights) stores its own min/max. This reduces quantisation error on outliers at the cost of a little memory for block scales. Q4_K_M uses this.

GGUF Q-levels. Q2_K (2 bit, most aggressive, quality loss 10%+), Q3_K_S/M/L (3 bit, 5-10% loss), Q4_0/Q4_1 (4 bit, old, 3-5% loss), Q4_K_S/M (4 bit with K-blocks, 1-2% loss – the sweet spot), Q5_K_M (5 bit, < 1% loss), Q6_K (6 bit, < 0.5% loss), Q8_0 (8 bit, < 0.3% loss). Rule of thumb May 2026: Q4_K_M is the right answer in 80% of cases. Q5_K_M when a touch more quality matters. Q8_0 only for very high quality demands.

AWQ (Lin et al. 2023). Activation-aware: considers which weights actually produce important activations and protects them from quantisation. Often slightly better quality than GGUF at equal bit depth but more complex to generate. Used in vLLM and text-generation-inference.

GPTQ (Frantar et al. 2022). Older, one of the first production quantisation schemes. Iterative optimisation of quantisation values. Still in use in May 2026 but increasingly displaced by AWQ and GGUF.

BitsAndBytes. Hugging Face integration. NF4 (4-bit normal float) primarily for QLoRA training, INT8 for simple inference quantisation. Convenient in Python inference pipelines.

Practical note. On the Ollama Hub and Hugging Face nearly every open-weight model has GGUF quantisations from a maintainer (e.g. TheBloke, Bartowski). Self-quantising is possible with llama.cpp (the `quantize` command), takes 5-30 minutes per model on a GPU. In 99% of SME cases: download quantisations, do not make them yourself.

Quantisation choice in 5 steps

01Clarify hardware budget: how much RAM/VRAM is available? Determines which model size and which quantisation level fits at all.
02Pick model size: 8B for simple tasks and edge, 30-70B for production RAG, 70B+ for highest quality (with matching hardware).
03Quantisation level: Q4_K_M as default. Q5_K_M when hardware has headroom. Q8_0 only with special quality demand. Q3 only when really needed.
04Eval suite against FP16 reference: 50-200 real tasks with expected results, test both variants, quantify quality loss.
05Production setup with monitoring: watch latency, quality score, memory footprint; on model updates rerun the eval suite.

When quantisation pays off

Four concrete application scenarios May 2026.

1. Self-hosted model on own hardware. Anyone running Ollama on their own server or hosting a model via vLLM/TGI almost always picks a quantised variant. Llama 3.1 8B in Q4_K_M (5 GB) instead of FP16 (16 GB), Llama 3.1 70B in Q4_K_M (42 GB) instead of FP16 (140 GB). Hardware cost drops 3-4x. See self-hosted-ollama-evaluation.

2. Edge / local application. A law firm wanting a local RAG assistant on every employee workstation (instead of centrally) needs a model that runs on 16-32 GB RAM. Q4-quantised 7B models are ideal. iPhone/iPad applications with a local model use Q4 or even stronger quantisation.

3. Optimising cloud inference cost. For high-volume applications (e.g. triage of 10,000 emails per day) a quantised open-source model on Together.ai or Replicate is often 5-10x cheaper than GPT-4 or Claude 4 for similar task quality. For classification and routine answers this suffices.

4. Special hardware setup. Apple Silicon (M2/M3/M4 Pro/Max/Ultra) with unified memory runs very efficiently with quantised models – the Apple Neural Engine is optimised for INT8/INT4. A Mac Studio M2 Ultra (64+ GB unified memory) is in May 2026 an attractive inference server for SMEs, delivering Llama 3.1 70B Q4 at 8-15 tokens/second.

Quantisation in cloud models. OpenAI, Anthropic and Google do not publicly state whether their commercial models are served quantised – most providers probably do so partially (BF16 or FP8). As a customer you do not notice; the relevant quality factors are published benchmarks and your own eval suite, not precision specs.

When NOT to quantise

Three constellations where quantisation can cause harm.

First: training / fine-tuning with full fine-tuning. Training a model with all weights needs high precision (FP16 or BF16). Apply quantisation only AFTER training. QLoRA training combines a quantised base model (4-bit frozen) with FP16 adapters – that is a different approach and works.

Second: very narrow tasks with high precision needs. Mathematical computation, exact logical reasoning, code with subtle bug risks. May 2026 studies show: aggressive quantisation (Q3 and lower) reduces the accuracy of these special capabilities more than general language. For math- and code-critical applications Q5/Q6 or FP16 is the safer choice. Own eval suite before deployment mandatory.

Third: small models (under 3B parameters). Quantisation works best on larger models – they have more redundancy and tolerate bit reduction better. Below 3B Q4 can bring noticeable quality loss; for 1B models Q4 is often already problematic. Rule of thumb: from 7B Q4_K_M unproblematic; from 30B Q3 still tolerable; below 3B use at least Q5/Q6 or FP16.

Trap: blindly trusting benchmark numbers. A quantisation with "2% quality loss" per MMLU or Hellaswag may cost 5-10% on your specific application – benchmarks do not cover all language types and task types. Own eval suite before choosing the quantisation level mandatory.

Trap: model-variant mismatch. Quantised versions are created by the community and differ. "Llama 3.1 70B Q4_K_M by TheBloke" and "Llama 3.1 70B Q4_K_M by Bartowski" can differ in performance. For production: eval on the specific file, not just on the label.

Trade-offs

STRENGTHS

60-90% memory saved – self-hosted becomes economical
Inference speed often 1.5-3x faster – smaller data = faster memory bandwidth
Edge deployment possible (mobile, embedded)
Cloud cost for quantised open-source models 5-10x cheaper than FP16 equivalent

WEAKNESSES

0.5-2% quality loss at Q4_K_M, more at more aggressive levels
Math and code tasks more sensitive to quantisation – own eval suite mandatory
Quantised model files from community sources need quality vetting
Quantisation does not replace model choice – bad 70B Q4 < good 30B Q5

FAQ

Q4_K_M or Q5_K_M – which to take?

Q4_K_M is the standard sweet spot in May 2026 – 1-2% quality loss vs FP16, 60-70% memory saved, compatible with Ollama, llama.cpp, LM Studio, all GGUF consumers. Q5_K_M costs 25% more memory (5 bits vs 4) and gives under 1% extra quality. If memory is tight: Q4_K_M. If memory is fine and quality matters: Q5_K_M. With very strong GPUs (80GB+) and medium models, Q6_K or directly FP16 pay off.

GGUF vs AWQ – when which?

GGUF if you use Ollama, llama.cpp, LM Studio, or a Mac/CPU – universally supported, easy to deploy. AWQ if you use vLLM or text-generation-inference on NVIDIA GPUs – faster throughput in server mode, better quality per bit. Rule of thumb May 2026: GGUF for single-person/SME setups, AWQ for multi-user servers in vLLM/TGI with GPU. GPTQ is older and gradually being replaced by AWQ.

Quantised embedding models – sensible?

In May 2026 yes, but differently: do not quantise the embedding model, quantise the computed vectors. The terms are binary quantisation and scalar quantisation in vector databases (see was-ist-vektor-index). Reduces RAM footprint of Qdrant/Weaviate indexes by 40-90% with minimal recall loss. Embedding MODELS themselves are rarely quantised because they are already small (typically 100-500 MB) – the memory gain would be marginal.

Sources

GGUF – Format Specification and llama.cpp Quantize Documentation · 2026-04
Lin et al. – AWQ: Activation-aware Weight Quantization for LLMs · 2023-06
Frantar et al. – GPTQ: Accurate Post-Training Quantization · 2022-10
Hugging Face – Quantisation Guide (BitsAndBytes, GPTQ, AWQ) · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call