OLLAMA · TECH

Ollama: local LLMs on your own hardware – where it works and where it does not

Ollama is a local runtime for open-source LLMs. Strong for privacy demos and CPU classification, slow for 70B models without GPU.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Ollama?

Ollama is an open-source runtime (MIT license) that runs open-weight language models on a single machine – Linux, macOS, and Windows. Started in 2023 as a CLI wrapper around llama.cpp, by May 2026 Ollama has matured into a polished tool with its own model registry, REST API on port 11434, and a library of over 200 quantisation variants.

Usage is simple: ollama pull llama3.2 fetches a model in a compact GGUF quantisation, ollama run llama3.2 starts a chat session. In production, ollama serve runs as a background daemon and other programs talk to the OpenAI-compatible HTTP interface. Current model families as of May 2026: Llama 3.1 and 3.2, Gemma 2 (Google), Mistral and Mixtral, DeepSeek-r1 (reasoning), Qwen 2.5 (Alibaba), Phi 4 (Microsoft), plus specialised embedding models like nomic-embed-text.

Version 0.5+ is current with active development. Ollama uses GPU acceleration when available (NVIDIA via CUDA, Apple Silicon via Metal, AMD via ROCm) and cleanly falls back to CPU otherwise. Quantisation is the key lever: a Llama-3.1-8B model in Q4_K_M quantisation fits in 5 GB of RAM and runs on a normal server at 15-30 tokens per second. The same model in 70B variant needs 40 GB of RAM and, without a GPU, delivers only 1-3 tokens per second – usable for batch jobs, not for chat.

On our Hetzner server (125 GB RAM, no GPU), 9 local models run: llama3.2 at 3B and 8B, gemma2:9b, mistral:7b, deepseek-r1:7b, qwen2.5:7b, phi4, plus nomic-embed-text for local embeddings. Use: privacy demos and CPU-capable classification tasks.

Why it matters

Locally running LLMs solve one concrete problem: professional secrecy. A law firm must not send client correspondence through US cloud endpoints, neither must a notary, nor a doctor under Art. 321 SCC. Cloud LLMs in EU regions (Mistral La Plateforme, Azure OpenAI with EU data residency) cover many cases – but not all. Where absolute data isolation is required, Ollama is an option.

The second use case is high-volume batch work with relaxed latency. Anyone classifying 10,000 emails per day (spam, client request, supplier dunning) pays between CHF 50 and CHF 300 per day with a cloud provider. With Ollama on a dedicated server, that runs for the cost of electricity once the hardware is in place. Response time is 1-3 seconds per mail instead of 300 ms – no issue for batch processing.

The third lever is embeddings. nomic-embed-text and bge-large run on CPU at good speed (50-100 texts per second). A RAG pipeline can keep embedding generation entirely local – no external provider call for the most frequent step in the pipeline. That saves money and simplifies the compliance story.

The reality check is important, though: local 70B models without a GPU are slow. Anyone insisting on the same quality as GPT-4o or Claude Sonnet needs GPU hardware (two A100s for full-precision 70B are not cheap) or a smaller task. The honest recommendation: Ollama for classification, extraction, embeddings, and sensitive demos – for complex generation, cloud LLMs under routing control.

How it works

Ollama has two parts: the daemon (ollama serve) and the CLI (ollama). The daemon keeps models in memory, accepts HTTP requests, and invokes the llama.cpp inference engine in the background. The CLI is a thin client to the daemon.

Model management: ollama pull <model> downloads a model from the Ollama registry. The model file ships as GGUF (GPT-Generated Unified Format), a quantisation optimised for llama.cpp. Each model usually has multiple tags: latest, q4_K_M, q5_K_S, q8_0 – the number is the bit-depth of the weights. Q4_K_M is the sweet spot for 7B-13B models (smaller file, hardly any quality loss); Q8_0 is closer to full precision but uses twice the RAM.

API: POST /api/generate for simple completion, POST /api/chat for chat with history, POST /api/embeddings for embedding generation. An OpenAI-compatible layer runs at /v1/chat/completions, so any OpenAI SDK works – base_url=http://ollama:11434/v1 and a dummy key are enough.

Deployment: in our configuration, Ollama runs in a Docker container with the model directory mounted as a volume. Behind the LiteLLM proxy: Ollama models are registered in the gateway under logical names (e.g. local-llama, local-mistral). Applications do not need to know about Ollama directly – they talk to LiteLLM, which routes to Ollama.

Memory management: Ollama keeps the most recently used model in RAM (default 5 minutes) and unloads it afterwards. To keep models permanently hot, set OLLAMA_KEEP_ALIVE=24h or hit them regularly with a health check. Holding multiple models in parallel needs RAM accordingly – nine models on a 125 GB server are fine as long as not all of them are loaded at once.

Ollama to production in 6 steps

01Size the hardware: at least 16 GB RAM for 7B models, 32 GB for 13B, 64 GB for 30B, GPU for 70B realtime chat.
02Install Ollama via Docker, mount the model directory as a volume, configure OLLAMA_KEEP_ALIVE and concurrency.
03Pull models and pick the quantisation: Q4_K_M for the 7B/13B sweet spot, Q8_0 when precision matters and RAM allows.
04Wire behind LiteLLM: register the Ollama endpoint as a provider in the gateway config.yaml under logical names like local-llama-8b.
05Run use-case tests: measure classification, extraction, and embeddings on real client data (or solid synthetic data).
06Monitoring: Prometheus scraper on Ollama metrics (model load time, tokens/sec), Grafana dashboard with quality KPIs from the LiteLLM audit log.

When to use Ollama

Ollama is the right choice when (a) absolute data isolation is a hard requirement (professional secrecy, client data with no third-country exposure), (b) high volumes with relaxed latency need to be processed, or (c) embeddings should be generated locally.

In practice: a law firm with client-correspondence classification (inbox triage), fiduciary office with local VAT receipt pre-classification, notary practice with local embeddings for document search. Also good for reproducible demos and training environments: the model runs transparently, no provider cost during learning.

For pilot projects with unclear budget, Ollama is an honest variant: first test on your own server with Llama-3.1-8B at Q4 quantisation and see if the quality is enough. If yes, keep it local. If not, switch to the cloud with clear data – the LiteLLM-based architecture allows the switch without code changes.

When not to use

Anyone needing complex generation or reasoning at GPT-4o / Claude Sonnet / Mistral Large quality without a GPU stays with cloud models. A local 70B model on CPU is possible but unusable for chat at 1-3 tokens per second.

Ollama is also unsuited for applications with very low, guaranteed latency – voice bots, live translation, interactive chat UI with streaming. Here cloud providers with specialised fast-inference models (Groq, Cerebras) or a GPU instance are more honest.

For pure pilot projects without data-protection needs, Ollama is overhead. A weekend chatbot is faster with a cloud model. Ollama enters the picture when data isolation or cost optimisation is required.

Trade-offs

STRENGTHS

Full data isolation – model and data do not leave your own hardware
No variable provider cost, only hardware and electricity
OpenAI-compatible API – existing applications can switch without changes
GGUF quantisation lets large models run on modest hardware

WEAKNESSES

70B models without a GPU are too slow for chat (1-3 tokens/sec)
Quality lags top cloud models (GPT-4o, Claude Sonnet, Mistral Large) for complex generation
Model updates and quantisation choices are a discipline of their own – not zero maintenance
Storing and mounting large model files (5-40 GB) requires hardware planning

FAQ

How fast is a 7B model on CPU?

On a modern AMD EPYC or Intel Xeon server with DDR4-3200 RAM and Llama-3.1-8B at Q4_K_M, Ollama delivers 15-30 tokens per second. An average 200-token answer therefore takes 7-13 seconds. Too slow for interactive chat, fine for background classification jobs. On Apple Silicon (M2/M3) it is markedly faster (50-80 tokens/sec) thanks to Metal acceleration.

Can I use Ollama in a RAG stack with Qdrant?

Yes, that is one of the most common setups. The embedding model (e.g. nomic-embed-text or bge-large) runs in Ollama, vectors land in Qdrant, search and rerank stay local, and the final answer comes from the language model – either also local (Llama 3.1 8B for simple cases) or via LiteLLM routing to an EU cloud (Mistral Large for complex cases). The whole pipeline stays under control.

Do I need an NVIDIA GPU?

No, models up to 13B run fine on CPU with enough RAM. A GPU only pays off from 30B models upward or when latency is critical. If GPU, then an NVIDIA card with as much VRAM as possible (24 GB for quantised 30B, 48 GB for quantised 70B) for Ollama. AMD GPUs work via ROCm but are less mature. For most Swiss SME setups, a GPU-free configuration is the first step.

Sources

Ollama documentation – installation, models, API · 2026-05
ollama/ollama – GitHub releases · 2026-05
llama.cpp project (underlying inference engine) · 2026-04
GGUF format and quantisation guide (Hugging Face) · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call