SELF-HOSTED OLLAMA · LLM PROVIDER

Self-hosted Ollama as an LLM provider: when does it replace OpenAI, Anthropic or Gemini?

Self-hosted Ollama on a Hetzner GPU or office server: pays off from 2-5M tokens/month, replaces cloud LLMs for revDSG-sensitive workloads, has clear quality limits.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is self-hosted Ollama in the provider sense?

Ollama is an open-source inference runtime that runs open LLM weights (Llama, Mistral, Qwen, DeepSeek distillations, Gemma) on a single machine and exposes an OpenAI-compatible HTTP endpoint. From a provider perspective self-hosted Ollama is a fourth category alongside US cloud (OpenAI/Anthropic/Gemini), EU cloud (Mistral, Cohere-EU) and open-weight providers (Groq, Together, DeepInfra).

The difference: with Ollama there is no provider. The machine sits in your own office or rack slot. No API logs in foreign hands, no third-country question, no usage billing. In return there is: hardware cost, electricity cost, maintenance effort, model quality at the level of the best open-weight models (Llama 3.3 70B, Llama 4 Scout, DeepSeek-V3.2, Mistral-Small-3, Qwen 2.5) – not the current top GPT model or Claude Opus.

Important: technically Ollama is the same as the "Ollama" tech-stack topic – but this page looks at Ollama from a provider perspective. When does your own Ollama instance replace a cloud provider? Which workloads fit? How does it pencil out against USD 0.10 for Gemini Flash-Lite or USD 0.30 for Llama 4 via Together?

Why it matters

Self-hosted Ollama solves three problems no cloud provider can.

First: revDSG / Art. 321 SCC without a third-country debate. When inference runs physically in your own office or in a Swiss datacentre, there is no data flow for a DPA to argue about. For client correspondence at a law firm, for payroll triage at a fiduciary, for KYC data at a wealth manager, that is often the only clean variant.

Second: cost predictability above a volume threshold. Whoever generates 5M output tokens/month pays around USD 50 with OpenAI the current top GPT model (USD 10/1M) – around USD 75 with Claude Sonnet (USD 15/1M). Whoever needs 50M output tokens (realistic for high-volume classification, document recognition, client chat) pays USD 500-750/month. Your own Hetzner GPU box with an A100 80GB costs roughly CHF 600/month fixed, whether 5 or 500M tokens. The break-even line is roughly 2-5M output tokens/month – below: cloud, above: self-host.

Third: control over model behaviour. A cloud model gets updated without warning; your prompts break overnight. With Ollama the model stays exactly the model you picked – until you update it yourself. For audit-ready workflows (Art. 957a CO) that is a non-trivial point: under an audit question in two years you can spin up exactly the model that gave the original answer.

How it works

Hardware sizing May 2026 by model:

Small models (CPU-capable): Llama 3.3 8B, Mistral-Small-3 (8B), Qwen 2.5 7B. Q4 quantised ~5-6 GB. Run on a strong CPU server (32-core Intel Xeon or AMD EPYC, 64 GB RAM) – throughput 5-10 tokens/second. Enough for async workloads (batch document recognition, classification, triage). Too slow for interactive chat.

Medium models (small GPU): Llama 3.3 8B + RAG, Mistral-Small-3 with reasoning, Qwen 2.5 14B. RTX 3060 12GB or small cloud GPU (NVIDIA L4, A10). Throughput 30-50 tokens/second. Enough for interactive applications with 1-2 users.

Large models (production GPU): Llama 3.3 70B, Llama 4 Scout (109B MoE), DeepSeek-R1-Distill-32B, Mistral-Large-3. A100 80GB, H100, or Mac Studio M3 Ultra 192GB. Throughput 50-100 tokens/second for 70B Q4. Enough for 5-20 concurrent users.

Enterprise models: Llama 4 Maverick 400B, DeepSeek-V3 671B. 8x H200 or 16x H100. Not an SME setup. Realistically only via cloud GPU rental or provider.

Cost model Hetzner GEX44 (A100 80GB): about CHF 600-700/month. At a Swiss provider (Infomaniak Public Cloud GPU, Exoscale GPU, Swiss Cloud Computing): CHF 1200-2000/month – but data physically in Switzerland. Office-local with an owned H100: roughly CHF 30-40k one-off plus electricity (250 W idle, 500 W under load = around CHF 30/month).

Deployment stack: install Ollama (curl script), pull the model (ollama pull llama3.3:70b-instruct-q4_K_M), start the server (ollama serve), reverse proxy in front (Nginx with TLS + auth token), LiteLLM gateway for routing and logging.

CIO decision: is self-host economical?

01Measure or estimate monthly token volume: input and output separately, over 90 days.
02Compare cloud cost: at current volume, what do Gemini Flash, Claude Sonnet, the current DeepSeek-V generation cost per month?
03Quality test: run a sample task (10-30 typical queries) against Llama 3.3 70B and Mistral-Small-3. Is it enough?
04Hardware variant: Hetzner GEX44 (CHF 600-700/month, DE), Infomaniak/Exoscale GPU (CHF 1200-2000/month, CH), own H100 (CHF 30-40k one-off + electricity).
05Estimate MLOps load: who runs the GPU? Who handles updates? Who reacts to an outage at 02:00?
06Pilot: 4 weeks of Llama 3.3 8B/70B or Mistral-Small-3 on a test machine, in parallel to cloud. Compare quality and latency.
07Routing decision via LiteLLM: which workloads go to self-host, which to cloud? Document the tier model.

When to self-host

Self-hosted Ollama is the right choice when (a) volume exceeds 2-5M output tokens/month AND 70B-class quality is sufficient, (b) data is confidential enough that every cloud call needs a TIA, or (c) budget planning matters more than peak quality.

Concrete uses: batch document recognition for a fiduciary with 500 clients (Llama 3.3 8B + Tesseract on a CPU server, nightly, zero cloud cost), client FAQ at a law firm (Llama 3.3 70B with RAG on an in-house A100, data in Switzerland), collections triage at an SME (Mistral-Small-3 on RTX 3060, in-office, no API quota worries), code review bot for a developer team (DeepSeek-R1-Distill-32B on a Hetzner GPU, no code in foreign clouds).

In a mix with cloud: self-host Ollama as the default route for Tier-1 confidential workloads, cloud (Gemini Flash, Claude Sonnet) as escalation for workloads that need peak quality AND lack hard data-protection requirements. LiteLLM makes routing transparent – application code only knows one endpoint.

When not to use

Self-hosted Ollama is the wrong choice when (a) volume is low (under 1M tokens/month) – Gemini Flash-Lite or another cheap cloud provider is economically better, (b) the workload needs peak quality (top-end legal argument, mathematical research, top-tier creative writing) – open-weight models lag Claude Opus and the current top GPT model by 5-15%, (c) the team has no MLOps routine.

MLOps routine means: GPU driver updates without downtime, VRAM-utilisation monitoring, A/B tests for model versions, quantisation review (Q4 vs Q5 vs Q8 – quality vs memory), tokenizer consistency on migration, inference-server updates (vLLM/Ollama versions). Without that, self-host buys you a full-time job.

Further cases: live voice agents with sub-300ms latency are hard without dedicated optimisation. Streaming for 5+ concurrent users on a 70B model needs serious GPU investment or staggered requests. Multimodal (image, audio, video) is limited in Ollama – cloud providers lead here.

In-office server: be careful. Climate control (an H100 emits 500-700 W of heat), power redundancy, theft protection, backup strategy. For an SME often the worse variant compared to a rented Swiss cloud GPU.

Trade-offs

STRENGTHS

revDSG / Art. 321 SCC without third-country debate when hosted in Switzerland
Fixed cost: economical vs cloud from 2-5M tokens/month
Model stability: no unannounced updates as with cloud providers
OpenAI-compatible endpoint – code identical to cloud calls
Audit-ready: the same model version remains runnable years later

WEAKNESSES

Open-weight quality lags Claude Opus and the current top GPT model by 5-15%
MLOps overhead: GPU drivers, quantisation, inference-server updates
Multimodal limited vs Gemini 2.5 Pro or the current top GPT model
70B-class needs a real GPU (A100 80GB+ or Mac Studio Ultra)
Hardware failure risk without cloud SLA – failover must be deliberately engineered

FAQ

Which model should I run on an office GPU?

On a single RTX 3060 12GB: Llama 3.3 8B or Mistral-Small-3, Q4. On an A100 80GB: Llama 3.3 70B as default for interactive workloads, DeepSeek-R1-Distill-32B for reasoning, Mistral-Large-3 for EU compliance focus. Three models in parallel on one A100 is unrealistic; one active plus a second quickly loadable is the working pattern.

Is a Swiss host worth it vs Hetzner DE?

If the compliance requirement is "data does not leave Switzerland": yes. Premium is a factor of 2-3 (CHF 1200-2000 vs CHF 600-700) but it gives the clean revDSG answer. If the requirement is "data does not leave the EU", Hetzner Falkenstein or Helsinki is enough. Decide per client group, not blanket.

How do I measure whether self-host pays off?

Rule of thumb: self-host (CHF 600/month Hetzner GPU) pays off from the point where cloud cost exceeds CHF 600/month. With a tier-1 model like Claude Sonnet (USD 15/1M output) that is around 40M output tokens. With Gemini Flash (USD 2.50/1M output) it is 240M. With Flash-Lite (USD 0.40/1M output) it is 1.5 billion – self-host never becomes economical there unless data protection forces it.

What happens on hardware failure?

At Hetzner: SLA 99.9%, recovery typically 4-12h. At Infomaniak/Exoscale: comparable. With your own office hardware: no SLA, depends on your supplier and spare-part stock. Recommendation: configure a second cloud route as failover in LiteLLM – on self-host failure traffic falls over to a paid cloud route, automatic recovery in seconds.

Sources

Ollama Documentation – Self-Hosted LLM Runtime · 2026-05
Llama 3.1 Hardware Requirements: 8B, 70B, 405B · 2026-04
Running LLMs Locally with Ollama and llama.cpp (2026 guide) · 2026-03
GPU Requirements 2026 (Spheron) – Llama 4 / DeepSeek V3 / Qwen 3 · 2026-04
Ollama VRAM Requirements: Complete 2026 Guide · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call