TOKEN PRICING · COSTS
Token costs explained: input, output, cache, provider comparison May 2026
What is a token, how does input pricing differ from output, what does 1M tokens cost at which provider? Table with every relevant model.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is a token?
A token is the smallest unit a language model processes text in. It is not identical to a word or syllable. Tokens arise from a procedure called byte-pair encoding (BPE) or related algorithms that fuse frequent letter sequences into one unit. Rule of thumb for German: 1 token equals about 0.65 to 0.75 words. For English: 1 token equals about 0.75 to 0.8 words. A German A4 page of 450 words thus equals around 600-700 tokens.
Important consequence: German texts consume more tokens per word than English because German compounds (e.g. "Mandantenkorrespondenzverwaltung") often split into several tokens. Calculating in English and then running the same content in German typically costs 25-40% more. In multilingual pipelines (FR, IT, RU) the ratio shifts further – Russian and Chinese are often 2-3x more expensive per "content".
Tokens are billed in two classes: input tokens (what you send to the model, including system prompt, context, question) and output tokens (what the model returns). Output is always more expensive, factor 3-5 at most providers. Reason: output requires full inference load (autoregressive generation), input goes through the model only once.
Why token pricing matters
Token costs are the running costs of every cloud LLM pipeline and thus what you see every month for twelve months. Unlike one-time engineering effort or hardware purchase, token costs are variable and scale with pipeline success: more usage, more bill.
Three errors regularly inflate token bills.
Error 1: Model choice without pricing awareness. Claude Opus costs around USD 15/75 (input/output) per 1M tokens. Claude Haiku costs USD 0.80/4. The current DeepSeek-V generation USD 0.30/0.50. Using Haiku or DeepSeek for simple classification instead of Opus saves factor 20-50 – without quality loss for that task.
Error 2: System prompts with full instructions instead of cache. A 3,000-token system prompt sent on every request costs 3M tokens/day at 1,000 requests – at Claude Opus around USD 45/day or USD 1,350/month. With prompt caching (Anthropic, OpenAI, Google) that drops to 10% – USD 4.50/day, USD 135/month. Skipping cache burns money.
Error 3: Wasting the context window. Some pipelines dump entire documents into the prompt even when only 3 sentences are relevant. RAG (see retrieval-augmented-generation) solves this: instead of 50,000 tokens per query, only 2,000-5,000 tokens of relevant passages get sent – factor 10-25 saving.
Fourth point: context window pricing varies between providers. Anthropic charges the first 1024 tokens of an active cache block as "cache write" (1.25x standard) and all subsequent reads at 10%. OpenAI prompt caching: reads at 50% standard. Google Gemini context caching: 25% standard for cached tokens plus per-hour storage fee.
Provider price table May 2026
All prices per 1M tokens, USD, as of May 2026. Format: input/output. Sources below.
OpenAI - GPT-4o (gpt-4o-2026): 2.50/10.00. Cache read: 1.25. - GPT-4o mini: 0.15/0.60. Cache read: 0.075. - GPT-4 Turbo: 10.00/30.00. - o1: 15.00/60.00 (reasoning, plus reasoning tokens). - o1-mini: 3.00/12.00. - text-embedding-3-small: 0.02. text-embedding-3-large: 0.13.
Anthropic - Claude Opus: 15.00/75.00. Cache write: 18.75. Cache read: 1.50. - Claude Sonnet: 3.00/15.00. Cache write: 3.75. Cache read: 0.30. - Claude Haiku: 0.80/4.00. Cache write: 1.00. Cache read: 0.08.
Google - Gemini 2.5 Pro: 1.25/5.00 (up to 200k context), 2.50/10.00 (>200k). Cache: 0.31. - Gemini 2.5 Flash: 0.075/0.30. Cache: 0.019. - Gemini 2.5 Flash-Lite: 0.0375/0.15.
Mistral - Mistral Large 2: 2.00/6.00. EU region available. - Mistral Medium: 0.40/2.00. - Mistral Small: 0.20/0.60. - Codestral: 0.20/0.60. - Mistral Embed: 0.10.
DeepSeek - the current DeepSeek-V generation: 0.30/0.50 (cache read 0.07). Very aggressive pricing. - DeepSeek-R1 (reasoning): 0.55/2.19. - Off-peak discount: -50% between 16:30-00:30 UTC.
xAI Grok - Grok 4: 3.00/15.00. - Grok 4-mini: 0.30/1.50.
Cohere - Command R+: 2.50/10.00. - Command R: 0.50/1.50. - Embed Multilingual v3: 0.10. - Rerank 3: 2.00 per 1k requests.
Self-host (Llama 3.1 70B on A100-80 Hetzner EUR 1,100/month) - At 50% utilisation: about 30M tokens/month throughput. Unit cost: about USD 0.04 per 1M tokens, all-in. At 90% utilisation: USD 0.02 per 1M.
Together AI (hosted open-weight) - Llama 3.1 70B: 0.88/0.88 (same price in/out). - Llama 3.1 405B: 3.50/3.50. - Mixtral 8x22B: 1.20/1.20.
Sample: 200 queries/month, 8,000 input / 1,500 output each (fiduciary profile) Monthly: 1.6M input + 0.3M output. - Claude Sonnet: 1.6 x 3 + 0.3 x 15 = USD 9.30 - GPT-4o: 1.6 x 2.50 + 0.3 x 10 = USD 7.00 - Mistral Large 2: 1.6 x 2 + 0.3 x 6 = USD 5.00 - the current DeepSeek-V generation: 1.6 x 0.30 + 0.3 x 0.50 = USD 0.63 - Self-host Llama 70B: 0.08
Span is factor 100. Model choice by task (see multi-LLM routing strategies) is the most effective cost lever.
Optimise token costs in 6 steps
- 01Add pipeline logging: LiteLLM, Langfuse, or OpenAI usage log. Per request input tokens, output tokens, model, duration.
- 02Measure one week: which query classes exist? Which could go to a cheaper model?
- 03Add routing: LiteLLM router with a classifier stage that sends queries to Haiku/mini, Sonnet/4o, or Opus/o1.
- 04Activate prompt caching: Anthropic cache, OpenAI prompt caching, Google context caching – cache system prompts and stable contexts.
- 05Use RAG instead of full documents in context: 2-5k tokens of relevant passages instead of 50k per query.
- 06Use off-peak: run the current DeepSeek-V generation batch jobs between 16:30-00:30 UTC – 50% discount.
Which model when
Model choice follows three criteria: task difficulty, latency requirement, data protection.
Simple classification, tagging, extraction: Claude Haiku, GPT-4o mini, Mistral Small, the current DeepSeek-V generation. Price range USD 0.15-0.80 input. Latency under 500ms. Quality on these tasks essentially identical to top models.
Standard research, summaries, contract review: Claude Sonnet, GPT-4o, Mistral Large 2, Gemini 2.5 Pro. Price range USD 1.25-3 input. Latency 800-1500ms. Sweet spot for 80% of fiduciary and legal applications.
Demanding research, legal analysis, multi-step reasoning: Claude Opus, OpenAI o1, DeepSeek-R1. Price range USD 5-15 input, USD 50-75 output. Latency 5-30 seconds (reasoning). Only deploy when the task truly demands it.
Privacy-sensitive (PII, professional secrecy): Mistral Large 2 in EU region (USD 2/6) or self-host Llama 3.1 70B (USD 0.02-0.04/1M, all-in). For cloud, demand "no model training" contract clauses – standard at OpenAI Enterprise, Anthropic API, Mistral La Plateforme.
Mass processing (document recognition, batch classification): the current DeepSeek-V generation with off-peak discount (50% cheaper between 16:30-00:30 UTC). At 100M tokens/month this is USD 30 input + USD 50 output = USD 80 at full peak, USD 40 off-peak.
Routing recommendation: 70% Haiku/mini/small (simple), 20% Sonnet/4o (standard), 10% Opus/o1 (complex). Saves around 80% vs. 100% Opus.
When token pricing is irrelevant
At very small volume – below 500,000 tokens per month – price-based model choice is irrelevant. The gap between Opus (USD 15) and Haiku (USD 0.80) at 500k tokens is USD 7.10 – negligible vs. engineering effort of the pipeline. At such volume you choose by quality, not by price.
At very large volume – above 1B tokens per month – cloud pricing is not the decisive factor; what counts are volume contracts with the provider (Anthropic Enterprise, OpenAI Enterprise) or self-hosting. List prices are then negotiation baseline, not final.
Reasoning models (o1, DeepSeek-R1, Claude Sonnet Extended Thinking) hide reasoning tokens. A query "explain tax case X" can internally generate 20,000-50,000 reasoning tokens billed in addition to visible output. Price per query rises factor 3-10 vs. standard model output. Use these only when accuracy justifies the surcharge.
For embedding pipelines (RAG setup, semantic search), token price is nearly irrelevant. OpenAI text-embedding-3-small costs USD 0.02 per 1M tokens – 100M tokens (full indexing of 40,000 long documents) costs USD 2. Optimising here optimises the wrong line item.
Trade-offs
STRENGTHS
- Provider price span factor 100 between the current DeepSeek-V generation (USD 0.30) and Claude Opus (USD 75) – targeted model choice saves 80% in token cost
- Prompt caching reduces recurring system prompts 80-90% (Anthropic, OpenAI, Google)
- Off-peak discounts (the current DeepSeek-V generation -50% nights UTC) for batch workloads
- Self-host unbeatable at high volume: USD 0.02-0.04 per 1M tokens all-in
WEAKNESSES
- German and multilingual texts tokenise less efficiently – 25-40% more tokens than English for the same content
- Reasoning models hide tokens: o1, DeepSeek-R1, Claude extended thinking generate many internal tokens that get billed
- Providers change prices and rate limits quarterly – the pipeline must not be hard-bound to one provider
- Data protection options cost more: Mistral EU 4x DeepSeek China; self-host has engineering overhead
FAQ
Why is output more expensive than input?
Output requires autoregressive generation – each new token needs a full forward pass through the model. Input is processed once (prefilling phase, highly parallel). At Claude Opus the output/input ratio is 5x, at most others 3-4x. At Together AI it is equal (1x) – the provider subsidises the pricing model.
What does prompt caching actually deliver?
Anthropic gives 90% off cached tokens on read (Sonnet: 0.30 vs. 3.00). Cache write is 25% more than standard. Example: 3000-token system prompt, 100 queries/day. Without cache: 100 x 3000 = 300k tokens x USD 3/M = USD 0.90/day. With cache: 3000 tokens as write (USD 0.01) + 99 reads (297k x 0.30/M = USD 0.09) = USD 0.10/day. Saving 89%.
How many tokens fit into a model?
Context window May 2026: GPT-4o 128k, Claude Sonnet/Opus 200k (1M experimental), Gemini 2.5 Pro 2M, Mistral Large 128k, the current DeepSeek-V generation 64k, Llama 3.1 128k. Practice: even if 200k fits, that costs a lot per query. RAG with 5k active context is usually cheaper and qualitatively equivalent.
The current DeepSeek-V generation at USD 0.30 – what is the catch?
Three points. (1) Data protection: DeepSeek API runs on Chinese servers. For EU/CH personal data problematic – DPIA + third-country check needed (see third-country transfer TIA). (2) Model training: requests may be used for training unless explicit opt-out. (3) Quality: the current DeepSeek-V generation is strong in English and math, in German legal it lags 10-20% behind Claude Sonnet. For EU fiduciary rarely first choice, for code generation or anonymous batch classification very much so.
Related topics
Sources
- OpenAI – API Pricing (GPT-4o, o1, GPT-4o-mini, embeddings) · 2026-05
- Anthropic – Claude API Pricing (Claude Opus, Claude Sonnet, Claude Haiku, prompt caching) · 2026-05
- Google – Gemini API Pricing (2.5 Pro/Flash, context caching) · 2026-05
- Mistral – La Plateforme Pricing (Large 2, Medium, Small, Embed, EU regions) · 2026-05
- DeepSeek – API Pricing (V4, R1, off-peak discount) · 2026-05
- Together AI – Inference Pricing (Llama 3.1, Mixtral) · 2026-05