fairlane.systems

OWN LLM · COSTS

What does your own LLM cost? Total cost of ownership in May 2026

Hardware, power, DevOps, maintenance: every TCO building block for a self-hosted language model, with real CHF and USD figures for May 2026.

Researched & fact-checked by: · As of: 2026-05

What is this about?

An own LLM means: you operate an open-weight model (Llama 3.1, Mistral, Qwen, DeepSeek, Gemma) on owned or rented GPU hardware instead of paying per token to OpenAI, Anthropic, or Google. The question "what does it really cost" can only be answered cleanly when every line item is on the table: hardware purchase or rental, power consumption, cooling, DevOps time, model updates, backup, monitoring, downtime risk, insurance. Anyone naming only the GPU price and comparing to cloud token prices is fooling themselves.

This page runs a total-cost-of-ownership calculation over 36 months. Numbers come from May 2026 published prices at Hetzner, Vast.ai, Together AI, Lambda Labs, RunPod, plus the a16z TCO model and the EleutherAI compute study. Goal: by the end you know when an own LLM pays off and when it does not, and you can reproduce the math for your own load.

Briefly upfront: for a typical Swiss SME or a 10-person fiduciary office with under 5 million tokens per month, an own LLM does not pay off financially. It does pay off above 50 million tokens per month with persistently sensitive content; in between it is a data-protection question, not a cost question.

Why the TCO question matters

Wrong cost calculations hit small offices hardest. We regularly see two failure patterns. First: a managing director reads a blog post about self-hosted Llama models, buys a CHF 22,000 server, and six months later finds utilisation at 3% – the capital is idle. Second: a fiduciary owner skips cloud LLM on data-protection grounds without checking whether 4 million tokens/month actually justifies an in-house server. Both errors cost between CHF 15,000 and CHF 60,000 per year depending on payroll structure.

Second point: an own LLM is an investment, not a subscription. Buying a server binds capital for 36 to 60 months. During that time hardware ages: an H100 bought in 2025 is no longer state of the art by mid-2026 (H200 and B100 are out). Without amortisation, the balance sheet carries the wrong value. With amortisation, TCO must honestly include depreciation – typically 30 to 40% per year in the first generation.

Third point: personnel. A self-hosted model needs someone to update GPU drivers, swap model versions, tune inference servers (vLLM, TGI, llama.cpp, Ollama), test backups, and read the monitoring. Without that person, the hardware is not "cheap" but "not production-ready". External managed services (see managed service monitoring) cost between CHF 800 and CHF 2,500 per month.

TCO components in detail

An honest TCO calculation for an own LLM has seven line items.

1. Hardware purchase or rental. An Nvidia A100 80GB SXM costs CHF 17,000 to CHF 22,000 in May 2026 (refurbished from CHF 12,000). An H100 80GB is CHF 35,000 to CHF 40,000. An H200 141GB CHF 45,000 to CHF 55,000. Cloud rental: A100 80GB on-demand at AWS/GCP/Azure roughly USD 4 to USD 5 per hour; at specialist providers (Lambda Labs, RunPod, Vast.ai) USD 1.07 to USD 2.50 per hour. 1-year reserved reduces by 30 to 50%. Hetzner GPU servers (RTX 6000 Ada, A100) sit at EUR 600 to EUR 1,400 per month – no hourly model, fixed.

2. Power. An A100 draws 300 to 400 watts under load, an H100 up to 700 watts. 24/7 operation means 2,600 to 6,100 kWh per year. Swiss industrial electricity May 2026 is CHF 0.18 to CHF 0.28/kWh. That is CHF 470 to CHF 1,700 per year for power alone, plus 30 to 50% for cooling – if the server sits in the office, air-conditioning adds on top.

3. Personnel/DevOps. Conservative estimate for productive operation: 4 to 12 hours per month for updates, monitoring, model swaps, patch day. At internal rates of CHF 120 to CHF 180/hour that is CHF 480 to CHF 2,160 per month or CHF 5,760 to CHF 25,920 per year. External managed services fall in the same range.

4. Model license costs. Meta Llama 3.1, Mistral Open-Models, Qwen, DeepSeek, Gemma are under open-weight licences (Apache 2.0, Llama 3 Community License) – free for commercial use. Note: Llama 3.1 has a clause above 700M MAU, irrelevant for SMEs. Mistral Large 2 has research-only licence – commercial use needs Mistral La Plateforme.

5. Software stack. Inference engines (vLLM, TGI, llama.cpp, Ollama) are open source. Observability (Grafana, Prometheus, Loki) too. Vector DB Qdrant (see Qdrant) has a free self-host mode. With multi-LLM routing, you also run LiteLLM (open source) as gateway.

6. Backup, redundancy, downtime risk. A second GPU server for failover doubles hardware cost. Alternative: failover to cloud LLM via LiteLLM routing (common pattern: 90% local, 10% cloud for peaks or outage).

7. Depreciation. Linear amortisation over 36 months is standard: 33% per year. On an H100 at CHF 38,000 that is CHF 12,650 per year of pure balance-sheet depreciation.

Sample calculation: 10-person fiduciary, 200 queries per month at 8,000 input + 1,500 output tokens each. That is 1.6M input and 0.3M output per month, 1.9M total. Cloud (Claude Sonnet): about USD 10 per month, or CHF 110 per year. Self-host (Hetzner GPU server with RTX 6000 Ada, EUR 750/month) plus 6h/month DevOps (CHF 720/month): about CHF 18,400 per year. Difference: factor 165 in favour of cloud. Self-host here only justifies itself on data-protection grounds, not on cost.

TCO calculation in 6 steps

  1. 01Measure token volume: log one week in a test pipeline (LiteLLM, OpenAI logging, Langfuse). Extrapolate to 12 months.
  2. 02Calculate cloud baseline: volume times provider price (Claude 3/15, GPT-4o 2.50/10, Mistral Large 2/6 USD per 1M tokens input/output). Add 20% for embeddings.
  3. 03Define hardware scenarios: (a) Hetzner GPU server EUR 600-1400/month, (b) buy A100 80GB CHF 17-22k, (c) buy H100 80GB CHF 35-40k. Each with 36-month amortisation.
  4. 04Estimate DevOps effort: 4-12h/month at CHF 120-180. External managed service: CHF 800-2500/month.
  5. 05Add power cost: 300-700W times 24/7 times CHF 0.18-0.28/kWh, plus 30% cooling.
  6. 06Compute break-even: at what monthly token volume do cloud and self-host curves cross? Rule of thumb: 5M = cloud, 50M = check, 100M+ = self-host.

When an own LLM pays off

An own LLM pays off financially when monthly token volume exceeds 50 million and load is steady. Example: an 80-attorney law firm with continuous research, clause review, document comparison at 80 queries/day times average 12,000 tokens gets to roughly 30M tokens/month. In cloud that is USD 200 to USD 400/month – still cheaper than self-host. Only from 100M tokens with local hardware at 80% utilisation does the calculation tip.

Non-financial reasons often dominate. Self-host pays off when (a) especially protected personal data (Art. 9 revDSG) is regularly processed, (b) clients contractually exclude cloud processing, (c) professional secrecy under Art. 321 SCC applies without client consent, (d) latency under 200ms is required (e.g. real-time voice agent).

Hybrid setups are most common in practice: a small local server with Llama 3.1 8B for PII filtering and sensitive classification (hardware budget CHF 8,000 to CHF 15,000) plus cloud LLM (Claude, Mistral EU) for the rest. Master data stays local, load stays in cloud, and costs stay reasonable.

When it does not pay off

An own LLM does not pay off when (a) monthly token volume is below 5 million, (b) load is irregular (e.g. 3 days full load, 27 days idle per month), (c) no in-house DevOps capacity exists, (d) content is not strictly confidential.

Concretely: a 4-person law firm doing only occasional AI research buys itself an investment ruin with an own GPU. An 8-person fiduciary with 50 clients and no continuous AI load likewise. A small healthcare practice classifying 200 documents per month is better served by cloud LLM in EU region (Mistral, Anthropic EU) plus DPIA and standard contract clauses.

Most common wrong decision: the reflex "data protection means own hardware". Data protection means "no personal data uncontrolled in the cloud". It does not mean "no cloud at all". An EU-hosted API with contract clauses, DPIA, and data minimisation meets revDSG requirements for most content without buying hardware. If 2% of your queries really contain high-sensitivity data, filter that 2% with a small local model – the rest goes to EU cloud.

Trade-offs

STRENGTHS

  • Token costs approach zero at high load (>50M tokens/month), only power + personnel remain
  • Data never leaves your infrastructure – clean revDSG and professional secrecy argument
  • Latency under 200ms possible without depending on external API availability
  • Capacity is plannable and reservable; no rate limits or surprise provider price hikes

WEAKNESSES

  • Capital lock-up CHF 8,000-55,000 plus 33%/yr depreciation – the GPU ages faster than it amortises
  • DevOps effort 4-12h/month or managed service CHF 800-2,500/month – personnel cost does not vanish
  • Below 10M tokens/month cloud is always cheaper; self-host only pays off from 50M up
  • Model quality: open-weight models lag 5-15% behind Claude/GPT-4o on legal factuality

FAQ

What does a GPU cost in 2026?

A100 80GB SXM: CHF 17,000-22,000 new, from CHF 12,000 refurbished. H100 80GB: CHF 35,000-40,000. H200 141GB: CHF 45,000-55,000. RTX 6000 Ada (48GB) as solid mid-range: CHF 7,500-9,500. Cloud rental: A100 USD 1.07-5/hr depending on provider (Vast.ai cheapest, AWS priciest). Hetzner GPU servers fixed: EUR 600-1,400/month.

When does buying beat renting?

Rule of thumb: if the GPU runs 18+ months at minimum 60% utilisation, buying is cheaper. At lower utilisation or shorter commitment, rental wins. Concretely: an A100 at CHF 19,000 amortised over 36 months means CHF 528/month – Hetzner rental of the same class is EUR 600-800/month. So buying only pays off clearly when you provide power, location, and maintenance yourself and use it 100%.

Do I really need a DevOps person?

For productive operation yes, unless you buy a managed service. GPU drivers, CUDA versions, model updates (a new Llama or Mistral appears every 2-4 months), inference server updates, monitoring alerts, backup validation: that is 4-12 hours per month. Without that care the setup stops being productive within 6 months. External managed service costs CHF 800-2,500/month and is often the most economical solution for SMEs.

What do power plus cooling cost per month?

A100 80GB at 350W under load, 24/7 yields 252 kWh/month. At CHF 0.22/kWh Swiss industrial rate that is CHF 55/month. Plus 30-50% cooling (when the server sits in a climate-controlled room): CHF 70-85/month. H100 at 700W is double: CHF 140-170/month. In an office without a server room estimate three times that due to inefficient cooling.

Related topics

SELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciariesSELF-HOSTED OLLAMA · LLM PROVIDERSelf-hosted Ollama as an LLM provider: when does it replace OpenAI, Anthropic or Gemini?HETZNER · TECHHetzner as EU hosting for Swiss fiduciaries and SMEs: data centres, contracts, costROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchBREAK-EVEN · COSTSCloud API vs. self-host: at what token volume does which pay off?

Sources

  1. a16z – Navigating the High Cost of AI Compute (TCO model for GPU inference) · 2026-03
  2. Vast.ai – On-Demand GPU Pricing (A100/H100/H200) · 2026-05
  3. Hetzner – Dedicated GPU Server Pricing (RTX 6000 Ada, A100) · 2026-05
  4. Together AI – Inference Pricing (Llama 3.1, Mistral, Qwen) · 2026-05
  5. Lambda Labs – GPU Cloud Pricing & Reserved Instances · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call