OLLAMA · HOW-TO
Install Ollama: step-by-step guide for Mac, Linux and Windows (May 2026)
Practical guide to installing Ollama 0.5+ on macOS, Linux and Windows including model download, REST API test, Q4_K_M quantisation, systemd setup and GPU acceleration.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is this about?
This guide takes you from a blank machine to a working local LLM server. You install Ollama in the current 0.5.x release, pull a suitable open-weight model (Llama 3.3 or Qwen3) and test the REST API on port 11434. We then cover quantisation (Q4_K_M as the sweet spot), systemd hardening for servers and the three GPU acceleration paths: CUDA for NVIDIA, ROCm for AMD, Metal for Apple Silicon.
The guide targets three audiences. First: fiduciary and law firms needing a local model for client classification or embedding generation without touching a US cloud. Second: developers wanting a reproducible LLM endpoint for tests in CI/CD. Third: SME IT leads building a proof of concept before deciding on cloud providers.
Prerequisite is a machine with at least 16 GB RAM (for 7B models), 32 GB (for 13B) or 64 GB (for 30B models). Disk: 30-100 GB for 3-5 models. Internet for the initial download. Optionally an NVIDIA, AMD or Apple Silicon GPU for 5-20x faster inference. All commands are ASCII-only and verified against May 2026.
Why this guide now?
May 2026 the picture has settled: Ollama with over 90,000 GitHub stars is the de-facto standard tool for local LLMs on single machines. Competitors like LM Studio (GUI-only), vllm (too complex for single boxes) or llama.cpp directly (too close to the metal) have clearly differentiated. Anyone wanting a local LLM server without building multi-GPU clusters picks Ollama.
The model landscape is mature. Llama 3.3 (Meta, December 2025) delivers 70B quality at GPT-4 level in Q4 quantisation that fits in 48 GB RAM. Qwen3 (Alibaba, March 2026) is the best open-source model for multilingual cases (DE/FR/IT/EN/ZH) at 14-32B size. DeepSeek-R1 (January 2025, still current) delivers O1-level reasoning at 7B-32B. Phi-4 (Microsoft) is the best 14B model for mathematical tasks.
For Swiss SMEs this means: there is no technical reason left to use cloud LLMs for pure classification and extraction tasks. Hardware costs once CHF 2,000-5,000 (Mac Studio M2 Ultra 64 GB, or Linux server with 64 GB RAM and RTX 4090), then it is just electricity. ROI against a cloud provider sits at 4-9 months for medium volume.
How the setup works
Ollama has two components: a daemon (ollama serve) that loads models into memory and answers REST requests on port 11434, and a CLI (ollama) acting as a thin client. The model directory lives at ~/.ollama/models (Linux/Mac) or %USERPROFILE%\.ollama\models (Windows).
The architecture is straightforward: install a binary, pull a model with ollama pull, then talk to the local API via http://localhost:11434/api/generate or /api/chat. An OpenAI-compatible layer lives at /v1/chat/completions so any OpenAI SDK works.
Quantisation is the decisive lever for speed and RAM usage. The label Q4_K_M means 4-bit weights with K-quants optimisation in variant M (medium). That halves RAM against FP16 and accelerates CPU inference by 2-3x at a quality loss of about 1-3% versus FP16. For 7B-13B models, Q4_K_M is the sweet spot. Q5_K_M costs 15% more RAM for marginal quality gains. Q8_0 sits near FP16 and only matters for critical generation tasks.
GPU acceleration is auto-detected. NVIDIA: Ollama uses CUDA 12.x via cuBLAS – the GPU needs at least Compute Capability 5.0 (GTX 1080 and newer). AMD: ROCm 6.x is supported on RX 6800 XT and newer; older cards via OpenCL fallback (slow). Apple Silicon: M1/M2/M3/M4 use Metal Performance Shaders automatically – Apple Silicon is unusually strong for LLMs because RAM and GPU share the same memory.
Server operation needs systemd hardening: ollama as a dedicated Unix user, directory restrictions via ProtectSystem, network binding only to 127.0.0.1 (behind reverse proxy) or to internal IPs (reachable from other containers).
Ollama installation in 10 steps
- 01Step 1 – Install Mac: in Terminal run `curl -fsSL https://ollama.com/install.sh | sh`. Alternative: download DMG from https://ollama.com/download and drag to Applications. Verify with `ollama --version` (expect 0.5.x or newer).
- 02Step 2 – Install Linux: `curl -fsSL https://ollama.com/install.sh | sh` places /usr/local/bin/ollama and starts a systemd service ollama.service under user ollama. Verify with `systemctl status ollama` and `ollama --version`.
- 03Step 3 – Install Windows: download OllamaSetup.exe from https://ollama.com/download/windows, install. Setup runs Ollama as a Windows service. Open PowerShell, run `ollama --version` – if unknown, restart the terminal or extend PATH manually.
- 04Step 4 – Pull Llama 3.3: `ollama pull llama3.3:70b-instruct-q4_K_M` downloads ~42 GB. For smaller machines: `ollama pull llama3.2:3b` (2 GB) or `ollama pull llama3.1:8b` (4.7 GB). Download runs over HTTPS from the Ollama CDN.
- 05Step 5 – Pull Qwen3: `ollama pull qwen3:14b` (8.5 GB) is the best pick for DE/FR/IT applications. For reasoning `ollama pull deepseek-r1:14b` (9 GB). For embeddings `ollama pull nomic-embed-text` (274 MB).
- 06Step 6 – Test REST API: in Terminal `curl http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"Say hello in Swiss German","stream":false}'` – answer comes as JSON with a response field. With stream:true tokens stream individually.
- 07Step 7 – OpenAI-compatible endpoint: `curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.2:3b","messages":[{"role":"user","content":"Hello"}]}'` – answer in OpenAI format. All OpenAI SDKs then work out of the box.
- 08Step 8 – Systemd hardening (Linux server): edit `/etc/systemd/system/ollama.service`, extend the [Service] block with `Environment="OLLAMA_HOST=127.0.0.1:11434"`, `Environment="OLLAMA_KEEP_ALIVE=24h"`, `ProtectSystem=strict`, `NoNewPrivileges=true`. Then `systemctl daemon-reload && systemctl restart ollama`.
- 09Step 9 – Enable GPU acceleration: NVIDIA: `nvidia-smi` must see the GPU, CUDA toolkit 12.4+ installed, Ollama detects automatically. AMD: check `rocm-smi`, install ROCm 6.0+, reboot. Mac: nothing to do, Metal is automatic. Logs: `journalctl -u ollama -f` shows "using GPU" on model load.
- 10Step 10 – Troubleshooting: too slow: pick a smaller quantisation (`ollama pull llama3.1:8b-instruct-q4_0` instead of q5_K_M). OOM (out of memory): smaller model or Q3_K_M. Wrong model answering: `ollama list` shows loaded models, `ollama rm <name>` removes one. API not responding: check `systemctl status ollama`, verify port 11434 with `ss -tlnp | grep 11434`.
When this guide fits
This guide is the right choice when (a) you have a single machine for LLM inference (Mac, Linux server, Windows workstation), (b) you want to keep one to five models in parallel and (c) use cases are classification, extraction, embeddings or medium-complex generation.
Typical cases: a law firm tests local client-correspondence classification with Llama 3.3 8B. A fiduciary office builds a receipt-recognition pilot with Qwen3 14B. A developer builds a RAG system and needs a local embedding endpoint with nomic-embed-text. An office wants a private chat assistant for internal research without queries leaving the building.
For multi-GPU clusters with several hundred requests per second, Ollama is the wrong stack – that calls for vllm or TGI. For single machines and small teams up to 50 concurrent requests, Ollama is the honest pick.
When this guide is not the fit
This guide does not fit when you need GPT-4o or Claude Sonnet quality without GPU hardware. A local Llama-3.3-70B on CPU runs at 1-3 tokens per second – too slow for chat. Anyone without a GPU budget stays on cloud models or routes via LiteLLM to Mistral EU.
Ollama is also unsuited for voice bots or other real-time streaming applications where time-to-first-token must be below 200 ms. Cloud providers with dedicated fast-inference models (Groq, Cerebras) are the better path there.
Another pitfall: Ollama on a machine with less than 8 GB RAM. Even a 3B model gets tight there – swapping kills throughput. For machines with 4-8 GB RAM, Mistral La Plateforme free tier or a Cohere trial is the more pragmatic start.
Trade-offs
STRENGTHS
- Install in under 5 minutes on all three operating systems
- OpenAI-compatible REST API lets existing SDKs run without code changes
- Automatic GPU detection for NVIDIA, AMD, Apple Silicon
- Q4_K_M quantisation halves RAM with minimal quality loss
WEAKNESSES
- 70B models without GPU are too slow for interactive chat (1-3 tokens/sec)
- Default binding only to localhost – network exposure requires extra work
- Model updates are a discipline – no automatic security patches for models
- Disk: 30-100 GB for 3-5 production models
FAQ
Which model should I try first?
On a Mac M2/M3 with 16 GB RAM: llama3.1:8b or qwen3:8b – both run at 30-50 tokens/sec and are good enough for DE/EN. On a Linux server with 64 GB RAM without GPU: qwen3:14b or llama3.3:8b. With GPU: up to 70B-quantised, quality near GPT-4. For embeddings always nomic-embed-text – 100-200 texts per second even on CPU.
Why Q4_K_M and not the original format?
Q4_K_M halves RAM versus FP16 and speeds CPU inference 2-3x at a 1-3% benchmark quality drop. For classification, extraction and medium-complex generation the difference is not measurable in practice. FP16 or Q8_0 only pay off for critical generation with hard quality requirements – and then a cloud LLM via LiteLLM routing is the better call.
How do I protect the Ollama endpoint on the network?
By default Ollama binds only to 127.0.0.1, i.e. localhost. To reach it from other containers, set OLLAMA_HOST=0.0.0.0:11434 in the systemd unit and place a firewall rule in front (ufw, nftables or iptables) allowing only internal IPs. For internet exposure always use a reverse proxy with auth (nginx + basic auth or better LiteLLM with virtual keys). Never bind ollama serve directly to the internet.
What to do on out-of-memory errors?
Three levers in this order: (1) smaller model: 8B instead of 14B, 14B instead of 70B. (2) lower quantisation: Q3_K_M instead of Q4_K_M (costs 5-10% quality, saves 25% RAM). (3) shorten num_ctx: default 2048, classification only needs 512 – pass `"options":{"num_ctx":512}` in the API request. If that is not enough: upgrade hardware or use a cloud LLM.