SELF-HOSTED VS. CLOUD · AI CONCEPT

Self-hosted vs. cloud LLM: a decision framework for SMEs and fiduciaries

When does running your own language model on your own hardware pay off, and when is the cloud the right choice? Total cost of ownership, latency, data protection.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is this about?

The choice between self-hosted and cloud LLM is not ideological, it is a commercial and regulatory question. Cloud LLM means: you send your prompts over HTTPS to OpenAI, Anthropic, Mistral, or Google and pay per token consumed. Self-hosted means: you run an open-weight model (Llama 3.1, Mistral, Qwen, DeepSeek) on your own GPU hardware or with a Swiss GPU provider such as Exoscale or Infomaniak.

The range of open-weight models in May 2026 is considerably broader than two years ago. Meta-Llama 3.1 70B, Mistral Large 2, Qwen 3.5 32B, and DeepSeek V3.2 deliver scores on many standard benchmarks that are close to GPT-4o or Claude Sonnet. For legal factuality and long contexts, the closed cloud models still hold a lead. So the decision is not "better or worse" but "which mix fits your load profile, data protection level, and budget".

In practice, hybrid is the rule: standard queries without personal data go to the cloud, sensitive client passages stay on a local model. A router (see LiteLLM, multi-LLM routing strategies) decides per request.

Why it matters

A wrong architecture decision costs money in both directions. A reflexive "local only" quickly binds CHF 25,000 to 40,000 capital in a GPU that sits idle 80% of the time. A reflexive "cloud only" accepts that all client documents see a US server farm – which under attorney professional secrecy (Art. 321 SCC) and fiduciary practice under revDSG is not straightforwardly permitted.

Three drivers shape the choice. First, the data protection level: personal data in especially protected categories (health, legal proceedings, social welfare) require heightened care under Art. 9 revDSG and usually a data protection impact assessment. Cloud LLM with a US provider works here only with standard contractual clauses, adequacy checks, and ideally an EU region. Second, the volume: under 5 million tokens per month, cloud is almost always cheaper; above 50 million tokens with steady load, an in-house GPU starts to pay off. Third, latency: cloud LLM needs 400 to 1500 milliseconds to first token; a local Llama 3.1 8B on an RTX 4090 delivers 80 to 200 milliseconds – only relevant if you build interactive chat front-ends or live voice pipelines.

For most Swiss fiduciary and law firms, the crux is not price but provability of data processing. Cloud logs get lost or are subject to US subpoena. Local logs live in your audit trail (see AI audit trail design) and under Swiss law.

How the comparison works

Total cost of ownership (TCO) is the only honest comparison metric. Putting "cents per token" against "GPU purchase price" lies to itself. A clean TCO has six components.

Cloud TCO = token cost + networking + logging/observability + compliance overhead (DPIA, contracts). Sample calculation May 2026: Claude Sonnet costs roughly USD 3 per million input tokens and USD 15 per million output tokens. A fiduciary pipeline with 30 clients, 50 queries/month each, averaging 8,000 input and 1,500 output tokens, means 12M input and 2.25M output, around USD 70/month plus embedding calls and audit overhead. At 10x that load it is around USD 700. At 100x around USD 7,000.

Self-hosted TCO = hardware amortisation + power + DevOps time + model updates + outage risk + GPU idle time. Llama 3.1 70B at 16-bit needs about 140 GB VRAM; at 4-bit quantisation about 42 GB – the latter runs on a single Nvidia A100 80GB or two RTX 4090s. An A100 80GB costs CHF 17,000 to 22,000 to buy, or roughly USD 1.07/hour on-demand at a hyperscaler, which is around CHF 700/month at 24/7 operation. Power plus cooling add CHF 80 to 150/month. DevOps time for updates, monitoring, and model swaps: expect 4 to 8 hours/month at internal CHF 120/hour, so CHF 500 to 1,000.

This yields the rule of thumb: under 5M tokens/month cloud is always cheaper. Between 5 and 50M tokens the answer is "depends" – data protection, peak load, availability decide. Above 50M tokens/month with steady load and sensitive content, an in-house server justifies itself. A hybrid with local Llama 3.1 8B for PII filtering plus cloud for the rest costs in the example above about CHF 200/month for the local node and delivers the data protection benefits without all load running locally.

Decision framework in 7 steps

01Data inventory: which content goes to the LLM? Classify under revDSG (public / internal / confidential / especially protected).
02Volume estimate: how many tokens per month? Multiply queries x tokens per query x 1.3 safety margin.
03Check latency requirement: do end users accept 1-2 seconds to first token, or must it stay below 200ms?
04Check professional secrecy: does the processing fall under Art. 321 SCC or an industry confidentiality duty? Is client consent in place?
05Calculate TCO: 12 months cloud vs. (hardware/36 months + power + DevOps + outage buffer). Hybrid as a third scenario.
06Start a PoC: 2 weeks of cloud with real load profile measured, in parallel 1 day local Llama setup on test hardware for quality comparison.
07Build routing logic: if hybrid, then LiteLLM or your own router decides per request (data classification -> model choice).

When to self-host

Self-hosting is the right choice when at least two of these conditions apply: (a) regular processing of especially protected personal data, (b) processing under professional secrecy (Art. 321 SCC) without explicit client consent to cloud LLM, (c) steady token volume above 30M/month, (d) latency requirements below 200ms, (e) explicit contractual clause with a customer forbidding any cross-border transfer.

Concrete real-world setups: a law firm with 12 attorneys running 200 weekly case research queries on files operates a local server with Llama 3.1 70B for the research tool and uses Claude Opus only when the client explicitly permits it. A fiduciary office with 80 clients and integrated document recognition runs Llama 3.1 8B for classification and uses Mistral Large 2 (in EU region) for booking decisions. A health-sector SME (medical practice software) runs Qwen 3.5 32B locally and sends nothing to cloud providers.

When cloud is enough

Cloud LLM is the right choice when (a) you are under 5M tokens/month, (b) the content holds no especially protected personal data, (c) you have no latency requirement under 300ms, and (d) you lack the in-house DevOps capacity to run a GPU server reliably.

In concrete terms: marketing copy, code generation, general research, language translation, FAQ bot with public content, accounting classification of anonymised receipts – all of that belongs in the cloud, ideally in EU region and with a written assurance from the provider that the data is not used for model training (standard with OpenAI Enterprise, Anthropic API, Mistral La Plateforme).

A common mistake: literal adherence to a "Switzerland only" doctrine leads to setups where you run Llama 3.1 8B on a 2080-Ti, the answers are 30% worse than Mistral Large 2 in EU cloud, and maintenance eats the entire budget. If the content is not strictly confidential, a good cloud provider costs less and delivers better quality.

Trade-offs

STRENGTHS

Self-hosted: data never leaves your infrastructure – compliant with professional secrecy
Self-hosted: latency below 200ms possible
Self-hosted: no variable token bill, predictable fixed costs
Cloud: no capex, no DevOps overhead, available immediately
Cloud: access to the strongest models (Claude Opus, das aktuelle GPT-Spitzenmodell)
Cloud: automatic model updates without own tuning

WEAKNESSES

Self-hosted: capex CHF 5,000 to 35,000 plus ongoing DevOps time
Self-hosted: open-weight models lag the top cloud models on specialty-domain factuality
Self-hosted: GPU idle time under fluctuating load eats the TCO calculation
Cloud: variable costs scale with success – uncomfortable at load peaks
Cloud: data transfer to the US needs DPIA and standard contractual clauses
Cloud: vendor lock-in; model changes can shift answer behaviour overnight

FAQ

At what volume does an own A100 pay off?

Rule of thumb: if your steady 24/7 load is between 30 and 50M tokens per month, an A100 80GB with Llama 3.1 70B Q4 begins to undercut cloud costs – provided you have in-house DevOps time. Below that threshold, renting GPUs at a Swiss provider (Exoscale, Infomaniak) or cloud LLM is cheaper. Above 100M tokens/month an own GPU is almost always cheaper.

Can I process client data with US cloud LLM?

With caution and only in certain constellations. Under revDSG, transfer to the US is in principle allowed if the provider signs standard contractual clauses (all major providers do), the client data is not used for model training (contractual clause), and you document a data protection impact assessment. For data under professional secrecy (lawyers, doctors, clergy), you additionally need either client consent or processing without identifying features.

Which open-weight model do you recommend today?

May 2026: for German and standard office tasks, Llama 3.1 70B Instruct or Mistral Large 2 – both run well on common hardware with 4-bit quantisation. For heavy coding tasks, DeepSeek V3.2. For resource-constrained setups, Qwen 3.5 14B or Llama 3.1 8B. For legal factuality we currently do not recommend open-weight alone – combine it with RAG and additionally use a cloud model for cross-validation.

What does a local server cost to buy?

Entry setup for Llama 3.1 8B or 14B: workstation with 1x RTX 4090 24GB, 64 GB RAM, 2 TB NVMe – CHF 5,000 to 7,000. Professional setup for Llama 3.1 70B: server with 1x A100 80GB or 2x A6000 48GB, 128 GB RAM, redundant power – CHF 25,000 to 35,000. Alternative: GPU rental at Exoscale or Infomaniak from CHF 1,200/month for an A100, no capex.

Sources

a16z – The Economics of Self-Hosted LLM Inference (TCO model) · 2026-02
Hugging Face – LLM Inference Benchmarks (Llama 3.1, Mistral, Qwen) · 2026-04
Ollama – Hardware Requirements and Model Sizing · 2026-05
EDÖB – Datenschutz-Folgenabschätzung bei Cloud-Diensten · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call