fairlane.systems

BREAK-EVEN · COSTS

Cloud API vs. self-host: at what token volume does which pay off?

Break-even analysis with May 2026 numbers. 1M / 10M / 100M / 1B tokens per month: where do cloud and self-host curves cross? Plus hidden costs.

Researched & fact-checked by: · As of: 2026-05

What is this about?

The "cloud API or self-host" question is mathematical, not ideological. Both models have defined cost functions, and exactly one intersection between them decides which makes sense for you. This page shows the curve at four points – 1M, 10M, 100M, 1B tokens per month – and names the hidden costs missing from naive calculations.

Cloud cost function: linear with token volume. Per 1M input tokens you pay USD 3 at Claude Sonnet, USD 2.50 at GPT-4o, USD 2 at Mistral Large, USD 0.30 at the current DeepSeek-V generation. Plus output tokens (3-5x input). Plus possible storage and caching fees. No fixed cost, no lock-in.

Self-host cost function: heavily fixed-cost loaded, marginal token costs near zero. An A100-80 on Hetzner costs EUR 1,100/month fixed. DevOps 6h at CHF 150 = CHF 900/month. Power + cooling CHF 100/month. Together about CHF 2,100/month – regardless of 1k or 100M tokens. Marginal cost per token is essentially zero as long as you stay under GPU capacity.

The intersection sits where token-volume-times-cloud-price equals self-host fixed cost. Rule: at Claude Sonnet around 50-80M tokens per month. At Claude Opus already at 10M. At the current DeepSeek-V generation only at 500M. Model choice shifts break-even dramatically.

Why the analysis matters

Without break-even calculation you make two typical errors.

Error 1: reflex "own hardware is cheaper". An 8-person office with 2M tokens/month reads in a blog that an A100-80 costs "only" EUR 1,100/month and orders a server. Cloud cost would have been USD 6/month. Self-host costs CHF 2,100 – factor 280 more expensive. After 12 months loss is CHF 25,000.

Error 2: reflex "cloud scales better". A 60-person company with 200M tokens/month pays Claude Sonnet around USD 1,200/month – USD 14,400/year. Self-host A100-80 at 80% utilisation handles that without trouble. CHF 25,200/year fixed vs. CHF 13,700 cloud. But: cloud contracts need renegotiation at every provider price jump, and availability hangs on external API. At 1B tokens, self-host would be clearly cheaper and more predictable.

Third point: hidden costs. Cloud pricing looks clear but has hidden items: egress fees if you send many tokens (rarely relevant), logging storage, rate-limit handling at peaks (often architecture effort), vendor lock-in. Self-host has hidden costs in DevOps time, depreciation, downtime risk without SLA.

Fourth point: the curve shifts. GPU prices drop 25-35%/year. Cloud token prices drop 10-30%/year (on older models) or stay stable (top models). A 2025 break-even is wrong today. As of May 2026 intersections tend higher – i.e. more pro-cloud – than they were in 2025.

Four volume tiers calculated

Assumptions: typical ratio 85% input / 15% output. Cloud model: Claude Sonnet (USD 3/15) as middle reference. Self-host: Llama 3.1 70B on Hetzner A100-80 GPU server EUR 1,100/month = CHF 1,050. DevOps 6h x CHF 150 = CHF 900. Power EUR 90 = CHF 85. Backup, monitoring, failover cloud bridge CHF 65. Total self-host: CHF 2,100/month. Self-host capacity: 50-80M tok/month at 70% utilisation with vLLM.

Tier 1: 1 million tokens/month (small fiduciary) Cloud Sonnet: 0.85M x 3 + 0.15M x 15 = USD 4.80/month = CHF 4.30. Plus one-time DPIA effort CHF 800. Self-host: CHF 2,100/month. Ratio: cloud 488x cheaper. Cloud, no debate.

Tier 2: 10 million tokens/month (mid law firm) Cloud Sonnet: 8.5 x 3 + 1.5 x 15 = USD 48/month = CHF 43. Self-host: CHF 2,100/month. Ratio: cloud 49x cheaper. Self-host only under data-protection compulsion.

Tier 3: 100 million tokens/month (mid-cap / voice-agent provider) Cloud Sonnet: 85 x 3 + 15 x 15 = USD 480/month = CHF 430. Self-host: CHF 2,100/month. But: 100M tokens are near capacity of an A100-80. At 70% utilisation: fits. At peak: second GPU or cloud bridge needed. Ratio: self-host 5x more expensive than cloud. Self-host only under data-protection compulsion, latency requirement, or expected load doubling.

Tier 4: 1 billion tokens/month (AI platform provider / SaaS with AI feature) Cloud Sonnet: 850 x 3 + 150 x 15 = USD 4,800/month = CHF 4,300. Self-host: one A100-80 is not enough. Three to four A100-80 plus tensor parallelism: CHF 4,500/month hardware (Hetzner) + CHF 1,500 DevOps + CHF 300 power = CHF 6,300. Alternatively: one H100-80 plus cloud bridge for peaks – CHF 4,500/month. Ratio: self-host 5-50% cheaper, depending on model. Data protection and latency further favour self-host. Self-host wins.

Model variation: what if you use a different model than Sonnet? - Claude Opus (USD 15/75): break-even shifts to 10M tokens. At 10M Opus costs USD 240/mo, self-host CHF 2,100. - GPT-4o (USD 2.50/10): similar to Sonnet, break-even at 70-100M. - Mistral Large 2 (USD 2/6): break-even at 200M. At 100M Mistral costs USD 260/mo. - the current DeepSeek-V generation (USD 0.30/0.50): break-even at 600-800M. At 100M DeepSeek costs USD 33/mo – self-host here 65x more expensive.

Lesson: anyone justifying self-host should first check whether a cheaper cloud model (the current DeepSeek-V generation or Mistral Small) also solves the task. In many cases "self-host vs. top cloud" is the wrong question – the right question is "top cloud vs. small cloud".

Break-even in 6 steps

  1. 01Measure token volume one week (LiteLLM log, Anthropic console, OpenAI usage). Input and output separately. Extrapolate to 12 months.
  2. 02Determine model mix: what share top, standard, budget? Derive a weighted price per 1M tokens.
  3. 03Calculate cloud annual cost: (input-M x input-price) + (output-M x output-price) x 12.
  4. 04Calculate self-host annual cost: 12 x (hardware rental or amortisation + DevOps + power + failover + monitoring).
  5. 05Add hidden costs: cloud (rate-limit handling, vendor risk, compliance overhead). Self-host (depreciation 30-40%/yr, downtime risk, DevOps recruiting).
  6. 06Decide: cloud annual cost < self-host x 0.7 = cloud. Self-host < cloud x 0.8 = self-host. In between, check hybrid.

When self-host actually pays off

On purely financial grounds, self-host pays off above these thresholds: - For top cloud models (Claude Opus, GPT-4 Turbo, o1): 10-30M tokens/month - For standard cloud models (Claude Sonnet, GPT-4o, Mistral Large): 50-100M tokens/month - For budget cloud models (the current DeepSeek-V generation, Mistral Small, GPT-4o-mini, Haiku): 500M - 1B tokens/month

Non-financial reasons that justify self-host even when the math says otherwise: (a) regular processing of especially protected personal data (Art. 9 revDSG) – a DPIA for cloud LLM is feasible, but for especially protected data it may lead to a self-host recommendation; (b) client-contractual clauses excluding cross-border transfer; (c) latency requirements under 200ms where cloud API is too slow (real-time voice agent, trading setups); (d) regulatory requirements (FINMA, strict ISO 42001 interpretation, industry-specific rules).

Hybrid setups are most common in practice and often the most economical solution: 70-90% of load runs on cloud (cost-optimal), 10-30% on a small local model for PII filter and sensitive classification. Hardware budget for the local node: CHF 8,000-15,000 purchase or EUR 500-700/month Hetzner GPU server. LiteLLM or a custom router (see multi-LLM routing strategies) decides per request.

When self-host is a wrong decision

Self-host is a mistake when (a) volume is below 10M tokens/month and the application will not grow steadily, (b) no in-house DevOps capacity exists and no managed-service contract is signed, (c) load is irregular (peaks with long idle phases), (d) content is not strictly confidential and can be processed with a simple DPIA + contract clauses in EU cloud.

Concretely: a 6-person fiduciary processing at most 5M tokens/month has no rational reason for own hardware. Even with 100% data-protection ambition, Mistral Large 2 in EU region (USD 2/6) plus DPIA and contract clauses is markedly cheaper than any self-host. Difference over 12 months: roughly CHF 22,000 less cloud cost than self-host.

A 4-person architecture office running one larger AI analysis per quarter (building plan check, code review) has an extremely irregular load profile. An own GPU sits idle 95% of the time – on-demand cloud LLM is the only rational choice.

General note: introducing self-host for "data protection" without simultaneously implementing audit trail, RBAC, backup, and update strategy delivers no data protection, only hardware. Cloud LLM with contract clauses and audit trail is the clean way in that case. Self-host is a discipline, not a switch.

Trade-offs

STRENGTHS

  • Cloud under 10M tokens/month unbeatably cheap – CHF 5-50/month at standard models, no lock-in
  • Self-host above 100M tokens/month clearly cheaper – unit cost USD 0.02-0.04 per 1M tokens
  • Hybrid combines privacy advantage with cloud economics – typical CHF 800-2,500/month for SME with mixed load
  • Mathematical clarity: token volume times cloud price vs. self-host fixed – no faith question

WEAKNESSES

  • Cloud pricing changes quarterly – pipeline must not be hard-bound to one provider
  • Self-host needs 4-12h/month DevOps or managed service CHF 800-2,500/month – personnel cost does not vanish
  • Hidden costs (depreciation, compliance, lock-in) distort naive calculations by factor 1.5-3
  • Data-protection requirements can force self-host even when math favours cloud

FAQ

What is the simple break-even formula?

Break-even tokens = self-host fixed cost per month / cloud price per 1M tokens. Example: self-host CHF 2,100, Cloud Sonnet effective USD 4.80 per 1M (weighted) = around CHF 4.30. Break-even: 2100 / 4.30 = 488M tokens/month. At Opus with effective CHF 22 per 1M: 2100/22 = 95M. At DeepSeek with CHF 0.34 per 1M: 6,200M.

Which hidden costs get forgotten most?

Cloud: logging storage (CHF 50-200/month with full request capture), rate-limit handling (architecture overhead at peaks), vendor lock-in (migration on price jump costs 2-4 weeks), compliance audit (DPIA and contract clauses one-time CHF 1,500-4,000). Self-host: GPU depreciation 30-40%/yr (hidden), on-call contract for weekend outages, update effort at each model release (every 2-4 months), peak power tariffs.

Is hybrid better than cloud-only or self-host-only?

Yes, for fiduciary, legal, SME with 5-50M tokens/month almost always. Hybrid means: a small local server (Llama 3.1 8B on RTX 4090, around CHF 200-400/month hardware rental) for PII filtering and sensitive 10-20% of queries; the rest goes to cloud (Mistral EU, Claude Sonnet). Advantages: data-protection argument for sensitive queries, cost efficiency for standard queries, no full GPU server overhead. LiteLLM routing automates the split.

How often does the calculation flip?

Check quarterly, recompute fully annually. Cloud prices drop 10-30%/yr on established models. GPU prices drop 25-35%/yr (refurbished faster). Model efficiency rises: new Llama versions need less VRAM for the same quality, new cloud models add caching and off-peak discounts. A 2024 architecture decision is often no longer optimal in 2026.

Related topics

SELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciariesOWN LLM · COSTSWhat does your own LLM cost? Total cost of ownership in May 2026ROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchLITELLM · TECHLiteLLM: one gateway for 100+ LLM providers behind a single APIHETZNER · TECHHetzner as EU hosting for Swiss fiduciaries and SMEs: data centres, contracts, cost

Sources

  1. a16z – Navigating the High Cost of AI Compute (break-even framework) · 2026-03
  2. Vast.ai – Cost Calculator & GPU Pricing · 2026-05
  3. Anthropic – Claude API Pricing & Cache Economics · 2026-05
  4. Hetzner – GPU Server Pricing Matrix · 2026-05
  5. Together AI – Inference Cost Benchmark (Llama 3.1, Mixtral, DeepSeek) · 2026-05
  6. fairlane.systems – Hybrid Setup Case Study (Treuhand-Büro, 5 personen, May 2026) · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call