fairlane.systems

GPU PRICES · COSTS

GPU cost calculator 2026: T4, L4, A10, A40, A100, H100, H200 compared

Which GPU fits which model, what does it cost at which provider, on-demand vs. reserved? May 2026 prices from AWS, GCP, Azure, Hetzner, RunPod, Vast.ai.

Researched & fact-checked by: · As of: 2026-05

What is this about?

Choosing a GPU is the most expensive single decision when building an own LLM stack. Wrong choice means either hardware that cannot load the desired model (too little VRAM) or hardware that sits 90% idle (overspec). Both errors cost between CHF 5,000 and CHF 40,000 in the first generation depending on class.

This page lists the eight relevant GPU classes in May 2026 with their typical VRAM, the matching model sweet spot, and on-demand plus reserved prices at seven providers: AWS, GCP, Azure, Hetzner, RunPod, Vast.ai, CoreWeave. Numbers should be read as May 2026 – they shift quarterly, usually downward.

The rule of thumb: hosting Llama 3.1 8B or smaller models (Phi-4, Gemma 9B) needs 16-24GB VRAM – a T4, L4, or RTX 4090 suffices. For Mistral Large or Llama 3.1 70B in 4-bit quantisation, you need 48-80GB – A40, A100-40, or A100-80. For Llama 3.1 405B or large Mixture-of-Experts models (the current DeepSeek-V generation, Qwen 3.5 235B) locally, you cannot avoid an H100 cluster or H200.

Why the GPU choice is critical

Three errors regularly drive up GPU cost.

Error 1: VRAM underestimated. A model loads only when all weights plus KV cache plus activations fit in VRAM. Llama 3.1 70B in 16-bit float needs 140GB – an A100-80 is not enough. In 4-bit quantisation that drops to 42GB – fits one A100-80 or two RTX 4090 with tensor parallelism. In 8-bit it needs 70GB – barely fits A100-80 with little context room. Wrong calculation buys a server that cannot start the model.

Error 2: Throughput underestimated. A T4 (16GB) can load Llama 3.1 8B in 4-bit but delivers only 15-25 tokens/second. An A100-80 hits 80-200 tokens/second on the same model. Interactive latency under 300ms needs more GPU than just "load the model".

Error 3: Reserved vs. on-demand miscalculated. On-demand prices are 2-5x higher than 1-year reserved. Running 24/7 on on-demand burns money. Running 6 hours/day on reserved burns the other half.

GPU prices in May 2026 are volatile. H100 80GB has dropped 20-30% vs. May 2025 because H200 and B100 are in market. A100 has dropped another 15%. Buying today means in 12 months your GPU is worth 25-35% less than purchase day. Reserved cloud contracts can hedge this risk.

GPU classes and prices in detail

T4 (16GB VRAM, Turing) – Entry class for inference. Llama 3.1 8B in 4-bit runs at 15-25 tok/s. On-demand: AWS USD 0.526/h, GCP USD 0.35/h, Vast.ai USD 0.20-0.30/h, RunPod USD 0.20/h. 1-year reserved: about 40% off.

L4 (24GB VRAM, Ada Lovelace) – Efficient inference, low power (72W). Llama 3.1 8B fluid, 30-50 tok/s. On-demand: GCP USD 0.7/h, RunPod USD 0.45/h, Vast.ai USD 0.30-0.45/h. Sweet spot for cost-optimised 8B hosting.

A10 (24GB VRAM, Ampere) – Mid-range, often for embedding models or smaller LLMs. On-demand: AWS USD 1.006/h, Azure USD 1.0/h, RunPod USD 0.60/h, Vast.ai USD 0.40-0.75/h.

A40 (48GB VRAM, Ampere) – Solid mid-range spot. Mistral Large in 4-bit possible, Llama 3.1 70B in 4-bit barely possible (tight). On-demand: AWS USD 1.5/h, RunPod USD 0.85/h, Vast.ai USD 0.65-1.10/h.

A100 40GB (Ampere) – Standard class for smaller models. Llama 3.1 70B in 4-bit possible, in 8-bit too tight. On-demand: AWS USD 3.06/h, GCP USD 3.67/h, Azure USD 3.4/h, RunPod USD 1.30/h, Vast.ai USD 0.80-1.30/h, Lambda Labs USD 1.10/h. 1-year reserved: 30-50% off.

A100 80GB (Ampere) – Workhorse for 70B-class models. Llama 3.1 70B in 8-bit comfortable, in 16-bit not (140GB needed). On-demand: AWS USD 4.10/h, GCP USD 4.95/h, Azure USD 5.05/h, RunPod USD 1.80/h, Vast.ai USD 1.07-1.80/h, Lambda Labs USD 1.79/h, CoreWeave USD 2.21/h. 1-year reserved Lambda: USD 1.20/h. Hetzner: EUR 850-1,200/month fixed.

H100 80GB (Hopper) – High-end for production inference and small training loads. Llama 3.1 70B in 16-bit possible, very fast (200-400 tok/s). On-demand: AWS USD 12.29/h (DGX), Azure USD 10.0/h, GCP USD 11.06/h, RunPod USD 2.79/h, Vast.ai USD 2.50-4.50/h, Lambda Labs USD 2.99/h, CoreWeave USD 4.25/h. 1-year reserved: USD 2.00-2.50/h. Hetzner: not yet in regular programme May 2026.

H200 141GB (Hopper Next) – Top class. Llama 3.1 70B in 16-bit + long context, or 405B in 4-bit (with tensor parallelism over 4 GPUs). On-demand: RunPod USD 4-7/h, Lambda Labs USD 4.50/h, CoreWeave USD 6.50/h. AWS/GCP/Azure still limited, USD 8-15/h.

Model-to-GPU table: - Llama 3.1 8B (4-bit): T4, L4, A10 ok. RTX 4090 perfect. - Llama 3.1 70B (4-bit): A100-80 or two RTX 4090. A40 tight. - Llama 3.1 70B (16-bit): H100-80, H200-141. Two A100-80 with NVLink. - Llama 3.1 405B (4-bit): 4x H100-80 or 2x H200. Not SME-relevant. - Mistral Large 2 (4-bit): A100-80 or A40 tight. - the current DeepSeek-V generation (MoE, 4-bit active): A100-80 suffices (not all experts active). - Qwen 3.5 32B (4-bit): A40, A100-40 ok. - Phi-4 14B: L4, A10, A40 all ok.

GPU selection in 6 steps

  1. 01Fix the model: which LLM (name + size + quantisation) will you run in the next 12 months?
  2. 02Calculate VRAM need: weights + KV cache + 20% reserve. Rule: 4-bit = params * 0.6 GB, 8-bit = params * 1.1 GB, 16-bit = params * 2 GB.
  3. 03Throughput need: queries per second at peak times average tokens per answer = tok/s needed.
  4. 04Match GPU candidates: from the table above, pick classes with enough VRAM and throughput.
  5. 05Decide on-demand vs. reserved: > 60% utilisation 24/7 = reserved or buy. Otherwise on-demand at RunPod/Vast.ai.
  6. 06Query provider prices: get hourly rates from at least 3 providers (Lambda Labs, RunPod, Vast.ai). Check Hetzner if a fixed price is preferred.

When which GPU class

The choice follows four questions.

Question 1: Which model? The biggest model you will run in the next 12 months sets the VRAM minimum. Starting today with Llama 3.1 8B but planning to switch to 70B in 6 months means buying an A100-80 directly instead of an L4.

Question 2: How much throughput? At 200 queries/month a T4 suffices (throughput is not the bottleneck). At 200 queries/hour you need A100 class or tensor parallelism. At 200 parallel live sessions (voice agent, chat platform) H100 is minimum.

Question 3: 24/7 or peaks? With 24/7 load, buying or reserved pays off. With < 8h/day load, on-demand at RunPod or Vast.ai is usually equally priced or cheaper.

Question 4: Latency target? Under 200ms time-to-first-token needs A100-80 or better. Under 100ms only H100/H200. Above 500ms an L4 suffices.

Typical setups: - 10-person fiduciary with RAG over 5k documents, 200 queries/month: L4 or A10 suffices. Hetzner GPU server EUR 600/month or RunPod L4 USD 0.45/h x 200h = USD 90/month. - 80-attorney law firm with contract generator and research, 5k queries/month: A100-80 needed. Hetzner EUR 1,100/month or Lambda Labs reserved USD 1.20/h x 720h = USD 864/month. - Voice agent with 20 parallel sessions: H100 80GB, Lambda Labs USD 2.99/h x 720h = USD 2,150/month (continuous), or reserved USD 2.00/h = USD 1,440/month.

When no GPU at all

When token volume is below 5M per month and content is not sensitive, you need no GPU at all. Cloud LLM API (OpenAI, Anthropic, Mistral, DeepSeek) at this load costs under USD 20/month – no GPU on earth amortises below that.

When the application only needs embedding (vector DB build, semantic search without generative component), a CPU server suffices. Embedding models like BGE-large or Multilingual-E5 run on CPUs at 100-300 tokens/second – enough for all SME applications.

When the application only does classification (e.g. document recognition with fixed categories), a classical ML model on CPU often suffices instead of an LLM. XGBoost, scikit-learn, sentence-transformers – all productive without GPU.

Other typical errors: buying a GPU to achieve "privacy" without implementing contract and audit prerequisites. Running local Llama 3.1 70B without audit log, RBAC, and backup strategy gives you no privacy, only its illusion. Cloud LLM in EU region with contract clauses and audit trail is cleaner in this case.

Trade-offs

STRENGTHS

  • Reserved 1-year prices cut on-demand 30-50% – predictable monthly cost
  • Hetzner GPU servers fixed at EUR 600-1,400/month – no hourly accounting, clear budget
  • Vast.ai on-demand for peak loads from USD 0.20-1.07/h depending on class – no lock-in
  • Refurbished A100/H100 30-40% cheaper than new – sensible with limited capital and 24/7 load

WEAKNESSES

  • Depreciation 30-40% in year one – a bought GPU is one third less worth after 12 months
  • Vast.ai reliability variable – no 99.9% SLA, risky for production workloads
  • AWS/GCP/Azure prices 2-3x above specialist providers – only sensible with free credits or specific integration
  • Hetzner H100/H200 May 2026 not regularly available – top class needs Lambda Labs, RunPod, CoreWeave

FAQ

What is the cheapest GPU for Llama 3.1 70B?

In 4-bit quantisation: two RTX 4090 (CHF 1,800-2,200 each) with tensor parallelism, or one A100-80GB. Cloud: Vast.ai A100-80 from USD 1.07/h, Lambda Labs reserved USD 1.20/h. Hetzner GPU server with A100 EUR 1,100-1,400/month. Self-host A100-80 amortises vs. cloud reserved after 14-16 months of 24/7 operation.

AWS or Hetzner – which is cheaper?

Hetzner is almost always cheaper for steady load. AWS A100-80 on-demand USD 4.10/h = USD 2,950/month. Hetzner GPU server A100 EUR 1,100-1,400/month = CHF 1,050-1,350. Even AWS reserved 1-year (USD 2.20/h = USD 1,585/month) is pricier than Hetzner. AWS wins only on spiky load (< 10h/day) or when you need other AWS services.

Does Vast.ai play a role in production?

Conditionally for production. Vast.ai offers the cheapest on-demand (A100-80 from USD 1.07/h, H100 from USD 2.50/h), but the GPU pool consists of individuals and small providers with variable reliability. Good for batch jobs, inference peaks, or dev setups. For 99.9% SLA production workloads, Lambda Labs, RunPod, CoreWeave, or Hetzner is safer.

How much depreciation per year?

First generation: 30-40% depreciation in year one, 25-30% in year two. Example: H100 80GB May 2025 CHF 48,000, May 2026 CHF 35,000-38,000, May 2027 expected CHF 24,000-28,000 (once B200 widespread). A100 80GB has fallen from CHF 28,000 in 2022 to CHF 17,000-22,000. Refurbished market is 30-40% cheaper than new but without manufacturer warranty.

Related topics

HETZNER · TECHHetzner as EU hosting for Swiss fiduciaries and SMEs: data centres, contracts, costSELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciariesOWN LLM · COSTSWhat does your own LLM cost? Total cost of ownership in May 2026SELF-HOSTED OLLAMA · LLM PROVIDERSelf-hosted Ollama as an LLM provider: when does it replace OpenAI, Anthropic or Gemini?OLLAMA · TECHOllama: local LLMs on your own hardware – where it works and where it does not

Sources

  1. AWS – EC2 GPU Instance Pricing (P4/P5/G5/G6) · 2026-05
  2. Lambda Labs – GPU Cloud Pricing (A100, H100, H200 on-demand & reserved) · 2026-05
  3. RunPod – Secure Cloud GPU Pricing · 2026-05
  4. Vast.ai – Marketplace GPU Pricing · 2026-05
  5. CoreWeave – H100/H200 Pricing & SLA · 2026-05
  6. Hetzner – Dedicated GPU Server Matrix · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call