TOKEN · AI CONCEPT
What is a token? Tokenisers, cost, DE-vs-EN May 2026
A token is a word fragment – the smallest billing unit of an LLM. Explained: BPE, SentencePiece, Tiktoken, German overhead, May 2026 price examples.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is a token?
A token is the smallest unit in which a language model processes text – and the billing unit of most LLM APIs. Models do not read and write letter-by-letter and not word-by-word, but in tokens that range from a few letters to a whole short word.
Rule of thumb May 2026: in English and German ASCII texts 1 token corresponds to roughly 3.5-4.5 characters or 0.5-0.8 words. Umlauts, accents, special characters, code symbols and Chinese/Japanese characters raise the token cost noticeably – a single umlaut taking 1-3 tokens is normal. An A4 page of standard German text (around 350 words) typically counts 500-700 tokens. A DIN-A4 letter with salutation, paragraph, signature costs between 200 and 400 tokens.
The purpose of token granularity is a compromise. Letter-by-letter processing would be too fine – sequences would become extremely long, the model would have to relearn letter compositions every time. Word-by-word processing would be too coarse – a German vocabulary with all conjugations and compounds would have millions of entries, blowing up memory and statistics. Tokens as middle ground: frequent word fragments get their own token, rarer ones are assembled from multiple tokens. The chosen vocabulary typically holds 30,000-300,000 tokens, depending on model and tokeniser.
For an SME tokens matter for two reasons above all: they determine the cost of every model call, and they limit how much text fits into one request (see was-ist-context-window). Anyone who cannot reason about tokens over- or underestimates costs and capacity.
Why tokens matter practically
Tokens directly touch three business decisions: cost, latency and capacity.
Cost May 2026. Vendors bill input and output tokens separately. Typical prices per 1 million tokens: OpenAI GPT-4.1 USD 2.50 input / USD 10 output; OpenAI GPT-4o-mini USD 0.15 / USD 0.60; Anthropic Claude Sonnet USD 3 / USD 15; Anthropic Claude Opus USD 15 / USD 75; Google Gemini 2.5 Pro USD 1.25-2.50 / USD 5-15 (tiered); Mistral Large 2 EUR 2 / EUR 6; the current DeepSeek-V generation USD 0.27 / USD 1.10. Output tokens are usually 3-5x more expensive than input tokens – the universal rule of thumb in May 2026.
Concrete example: a fiduciary chatbot handles 80 client inquiries per day. Per inquiry typically 2,000 tokens input (system prompt, RAG context, question) and 600 tokens output (answer). With Claude Sonnet that is 80 * (2,000 * USD 3 + 600 * USD 15) / 1,000,000 = USD 1.20 per day, around USD 36 per month. With the current DeepSeek-V generation USD 0.10 per day, around USD 3 per month. With Claude Opus USD 6 per day, around USD 180 per month. Factor 60 between cheapest and most expensive vendor – at near-identical quality for standard fiduciary questions.
Latency. Output tokens are generated token-by-token. At typical generation speeds of 50-200 output tokens per second (May 2026) a 600-token answer means 3-12 seconds of waiting. Input tokens, by contrast, are processed in parallel – a 10,000-token input takes 0.5-3 seconds to the first output token (time-to-first-token, TTFT). To minimise latency: cap output ("answer in at most 200 tokens") and reduce input (RAG instead of full document).
Capacity. The context window measures tokens, not words or characters. Whoever has 128k tokens (approx. 180-250 A4 pages of German text) cannot pack "a bit more" in without the request being rejected. In code or in heavily formatted documents (tables, JSON, XML), the token-per-page rate is higher – 1,500-3,000 tokens per page are not unusual.
Strategic consequence. As of May 2026 the most important cost lever is not "cheapest vendor" but "right model tier per request". A multi-LLM gateway (see was-ist-llm-gateway) lets you route simple requests (language detection, short classification) to cheap models (DeepSeek, Gemini Flash, GPT-4o-mini) and only complex requests to expensive ones. That typically cuts token cost 3-10x without quality loss at the right spots.
Tokenisers in detail
Three tokeniser families dominate the industry in May 2026. Knowing the differences avoids unpleasant surprises when switching vendors.
Byte Pair Encoding (BPE). Oldest of the three, originally from data compression. Idea: split text into letter pairs, find the most frequent pairs, merge them into new tokens, repeat. Result: frequent word fragments get compact tokens, rarer ones are built from smaller pieces. OpenAI uses BPE (tiktoken implementation, open-source since 2023). Tiktoken supports several encodings: cl100k_base (GPT-3.5/4 family), o200k_base (GPT-4o, GPT-4.1). The o200k encoding has a larger vocabulary and reaches 5-15% fewer tokens than cl100k on multilingual text.
SentencePiece. Google development (2018). Treats text as a byte stream, is language-agnostic and includes whitespace as a regular token. Gemini, Llama, Mistral and many open-source models use SentencePiece variants. Advantage: no a-priori language knowledge needed, good for multilingual models. Llama 3/4 has a SentencePiece vocabulary of 128,000-256,000 tokens – one of the largest in May 2026.
WordPiece. Older (BERT family, Google 2018), still present mostly in encoder models (BERT, DistilBERT, mBERT) in May 2026. Similar to BPE but with a probability model instead of pure frequency. For generation models May 2026, no longer mainstream.
Practical consequence for German. German texts typically need 20-30% more tokens than equally long English texts. Reason: umlauts (oe, ae, ue, sz) are rarer in training corpora than English standard letters, so they do not become their own tokens – an "oe" often costs 2 tokens. Compounds ("Aktiengesellschaftsversammlungsbeschluss") are split into many pieces. Long genitive and plural endings likewise. Vendors know this and have optimised their tokenisers in recent years (o200k_base at OpenAI, Llama-3 tokeniser, Gemini tokeniser): the May 2026 generation is 10-25% more efficient for German than the 2022 generation.
Code and JSON. Code tokens are often very fine because programming-language syntax contains many special characters (`{`, `}`, `(`, `)`, `;`, `=>`). A 100-line JavaScript function can have 1,000-2,500 tokens. JSON is similarly verbose. For structured data formats: 1 line = 8-25 tokens, rule of thumb.
Practical token counting. As of May 2026 several tools are available: tiktoken (Python/JS library, for OpenAI models), Anthropic Token Counter API (HTTP endpoint, for Claude), Hugging Face Tokenizers (Python, for open-source models), Google AI Studio Token Counter (web UI, for Gemini), platform.openai.com/tokenizer (web UI). Every model has its own tokeniser – an input of 2,000 GPT-4 tokens does NOT automatically have 2,000 Claude tokens. In May 2026 differences in practice are 5-15% – calculate with a safety buffer.
When token knowledge becomes active
Three concrete occasions make token knowledge indispensable.
Occasion 1: cost estimation before project start. Before commissioning a RAG assistant, an email-triage agent or a voicebot pipeline, you need a realistic estimate of monthly token costs. Rule of thumb: take volume from a typical week, multiply by tokens per request (estimated or measured) and by the token price of the target vendor. Realistic SME numbers May 2026: a fiduciary with 5-30 client chats per day reaches USD 20-150 per month in AI tokens, at medium RAG depth and Claude Sonnet class. A large law firm with document analysis reaches USD 500-3,000 per month. Whoever expects 10x more requests has a 10x higher token budget. Linear scaling is reality, not "volume discount magic".
Occasion 2: vendor switch. If you use Claude Sonnet today and want to switch to the current DeepSeek-V generation (factor 10 cost advantage), that does not mean factor 10 cost reduction. Different tokenisers give the same document different token counts. DeepSeek tokenises German slightly less efficiently than Claude (May 2026 status). Realistic advantage: factor 8-9, not 10. Only a concrete test with your real texts gives the exact number.
Occasion 3: context-window planning. When you build a RAG assistant that feeds 10-20 document chunks plus system prompt plus question plus conversation history per request, you must count whether that fits in the 128k-token standard or requires long-context mode. Rule of thumb: system prompt 200-2,000 tokens, RAG chunk 200-1,500 tokens depending on strategy, conversation history 0-30,000 tokens, question 50-500 tokens. A realistic client chat needs 8,000-25,000 tokens input – comfortable in 128k.
Occasion 4: output budget. Output tokens are 3-5x more expensive than input. Capping answers ("answer in at most 150 tokens") saves measurably. Rule of thumb: 150 tokens = 1 paragraph, 500 tokens = 1 A4 page. Chat answers usually need 200-400 tokens, report generation 1,000-3,000 tokens, long texts 4,000-8,000 tokens. Setting a hard max_tokens limit in the API is mandatory – without it a verbose model can generate 4,000+ tokens and triple costs.
Occasion 5: compliance documentation. As of May 2026 EU AI Act Article 12 logging and revFADP DPIA requirements demand recording of per-request token consumption (for cost accountability) and model choice. Token consumption in the audit log is mandatory information for audit-ready AI applications in May 2026.
Token micro-optimisation is often wasted time
Three cases where token tweaking adds no value – or causes harm.
First: micro-optimisation for a few percent. Spending 30 minutes to cut a system prompt from 400 to 350 tokens saves about USD 0.15 across 1,000 requests per month. That time is better spent elsewhere (data quality, RAG tuning, eval suite).
Second: aggressive compression into vagueness. A clear 600-token system prompt with role, task, prohibitions and format instructions can be compressed to 250 tokens – but then the model often lacks context. Hallucination risk rises, refusal behaviour becomes erratic. Rule of thumb: the system prompt may be 5-10% of the typical request token budget; clarity matters more than token saving.
Third: optimising tokens instead of model choice. Using the most expensive Claude Opus for a simple classification task and then saving 200 tokens optimises the wrong axis. Switching to Claude Haiku or DeepSeek brings 10-50x savings versus token tweaking within the Opus call. First check the model tier, then optimise.
Trap "we build our own tokeniser". As of May 2026 this makes no sense for SMEs. Tokenisers are trained jointly with the model; your own tokeniser means your own model training, which requires million-budget. Whoever wants to lower token cost switches model or reduces request volume – not "builds own tokeniser".
Trap "we can reverse-engineer tokens". Vendors quietly change tokeniser versions (OpenAI switched cl100k to o200k for GPT-4o in 2024). Token counts based on third-party tokenisers (e.g. tiktoken estimate for Claude) are 5-15% off. For hard budget calculations always use the official tokeniser of the target model.
Trade-offs
STRENGTHS
- Clear, predictable billing unit for LLM cost
- Tokeniser efficiency for German has doubled 2022-2026
- Token counter tools from all vendors official and free
- Enables precise capacity and budget planning
WEAKNESSES
- Different vendors count tokens differently (5-15% variance)
- German needs 20-30% more tokens than English
- Output tokens 3-5x more expensive than input – easily underestimated
- Tokeniser versions change silently at the vendor
FAQ
How do I count tokens in practice?
OpenAI: platform.openai.com/tokenizer (web UI for click tests) and the Python/JS library "tiktoken". Anthropic: HTTP endpoint /v1/messages/count_tokens (token counter API, free, official since 2024). Google: AI Studio Token Counter in the Studio UI plus SDK function. Open-source models (Llama, Mistral, Qwen, DeepSeek): Hugging Face library "transformers" (AutoTokenizer.from_pretrained). Rule of thumb without a tool: 1 token = 3.5-4.5 characters German/English, with umlaut overhead 20-30%.
Why does German cost more tokens than English?
Tokenisers are trained statistically on training corpora. English is 6-10x more strongly represented than German, so frequent English word fragments get their own tokens while German compounds, umlauts and long endings are split into multiple tokens. In May 2026 the overhead in modern tokenisers (o200k, Llama-4, Gemini-2) is about 20-30% versus English; old tokenisers (cl100k, GPT-3 era) had 50-100% overhead. Practical consequence: calculate token budgets for German-language applications with a 25% safety buffer.
Are output tokens really more expensive than input tokens?
Yes, in May 2026 with all major vendors. Factor typically 3-5x. Reason: input is processed once, output must be generated autoregressively token-by-token – which costs markedly more GPU time per token. Anthropic Claude Sonnet: USD 3 input / USD 15 output (factor 5). OpenAI GPT-4.1: USD 2.50 / USD 10 (factor 4). Google Gemini 2.5 Pro: USD 1.25 / USD 5 (factor 4). Consequence: limit output, prescribe clear output formats, set max_tokens hard.
Are tokens and words the same?
No. An average English word has 1.2-1.4 tokens, an average German word 1.5-2.5 tokens (due to compounds, endings, umlauts). Very short words ("the", "and", "of") are often 1 token. Very long words ("Aktiengesellschaftsversammlungsbeschluss") can have 6-12 tokens. Whoever reasons in words off the cuff typically underestimates tokens by 30-50% – especially in German.
Related topics
Sources
- OpenAI – Tiktoken Library and Tokenization Documentation · 2026-04
- Anthropic – Token Counting API Reference · 2026-05
- Kudo and Richardson – SentencePiece: A Simple and Language-Independent Subword Tokenizer (arXiv:1808.06226) · 2018-08
- Sennrich, Haddow, Birch – Neural Machine Translation of Rare Words with Subword Units (BPE) (arXiv:1508.07909) · 2015-08
- Hugging Face – Tokenizer Documentation and Comparison · 2026-03