CONTEXT WINDOW · AI CONCEPT
What is the context window? Token limit, cost, recall curve May 2026
The context window is the maximum token amount per LLM request (input + output). In May 2026: 128k-2M depending on provider, with a clear recall drop from the middle.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is the context window?
The context window – also called context length – is the maximum number of tokens a language model can process in a single request. It counts everything together: system prompt, client data, previous history, current question AND the answer to be generated. If the window is exceeded, the provider rejects the request or truncates it.
In May 2026 its size is a main selling point of vendors. State of the top models: OpenAI GPT-4.1 allows 128k tokens (standard), in the long-context variant 1m tokens. Anthropic Claude Sonnet and Opus run with 200k tokens by default, in beta up to 1m tokens (Sonnet) or 500k tokens (Opus). Google Gemini 2.5 Pro offers 2m tokens, Gemini 2.5 Flash 1m tokens. Mistral Large 2 stays at 128k tokens. The current DeepSeek-V generation has 128k tokens. Llama 4 (Meta, April 2025) reaches per variant 128k to 10m tokens (Llama 4 Scout).
The raw window size is only half the truth. The other half is how well the model actually finds information within that window. As of May 2026 that has a clear, recently sharpened answer: not as well as the marketing number suggests.
A token is a word fragment (see was-ist-token). Rule of thumb for German: 1.3 tokens per word, 500-700 tokens per A4 page. 128k tokens correspond to roughly 180-250 A4 pages, 1m tokens roughly 1400-2000 A4 pages. In theory enough for a complete client file; in practice with caveats.
Why size – and its limits – matter
The context-window question decides architecture, cost and quality of your AI application. Three effects are decisive in May 2026.
First: what fits without splitting. A 30-page tax inquiry with attachments fits comfortably in 128k tokens. A 600-page client dossier overwhelms classical windows but fits in Gemini 2.5 Pro (2m) or in the 1m beta variants. With long context you can skip some RAG steps and pass the whole document through – convenient but expensive.
Second: cost. Vendors bill tokens, input and output separately. Typical May 2026 prices: GPT-4.1 USD 2.50 per million input tokens / USD 10 per million output tokens; Claude Sonnet USD 3 / USD 15; Claude Opus USD 15 / USD 75; Gemini 2.5 Pro USD 1.25-2.50 / USD 5-15 (tiered by input size); Mistral Large 2 EUR 2 / EUR 6; the current DeepSeek-V generation USD 0.27 / USD 1.10. A 100k-token request to Claude Opus costs USD 1.50 just for input; to the current DeepSeek-V generation USD 0.027 – factor ~55. At thousands of such requests daily, the difference becomes real.
Third: recall loss from the middle. The most important, often-overlooked detail. Needle-in-haystack tests (Greg Kamradt 2023; refined in the arxiv papers "Lost in the Middle" by Liu et al. 2023 and "RULER" by Hsieh et al. 2024) show: models find information at the start and the end of a context reliably, but recall drops in the middle. For long contexts this effect intensifies dramatically. From around 50-60% of the maximum window recall drops measurably, from 80-90% markedly. In May 2026 the best models (the current top Claude model, Gemini 2.5, GPT-4.1) are clearly better at this test than the 2023 generation, but no model is immune. Practical consequence: in a 1m-token request information from the middle 400k tokens is missed more often than the first or last 100k.
Fourth effect: latency. Long-context requests take longer. A 200k-token request to Claude Sonnet takes 5-30 seconds from sending to the first output token (time-to-first-token, TTFT). A 10k-token request typically 0.5-2 seconds. For interactive apps this is noticeable.
For an SME the consequences are clear: larger context windows are useful but do not replace RAG. As of May 2026 RAG remains the more economical and qualitatively more stable method for knowledge bases over 30-50k tokens (see retrieval-augmented-generation).
How the window technically works
The context window is a limit on attention computation. Self-attention scales O(n^2) in sequence length (see was-ist-attention-mechanismus): double the length = four times the compute. Vendors have combined three techniques to make the May 2026 sizes economically viable.
Technique 1: efficient attention algorithms. FlashAttention (Dao et al., 2022/24) reduces memory overhead and speeds up computation 2-4x. Sliding-window attention (Mistral) and sparse attention (Longformer-style) break O(n^2) complexity by having each token attend only to a local window plus a few global tokens. Ring-attention (Liu et al. 2023, central in Gemini 1.5/2.5) distributes computation across many GPUs, which is what makes 1m-token windows practical in the first place.
Technique 2: positional-encoding extension. Models are usually trained at a specific max length (e.g. 8k or 32k tokens) and then extended to longer contexts. Methods like YaRN (Yet another RoPE extensioN), position interpolation and dynamic NTK enable extension without complete retraining. Qualitatively the model is usually less well tuned for the extended range than for the trained range – one cause of recall loss at the top.
Technique 3: caching. Anthropic prompt caching (GA 2024), OpenAI prompt caching (2024) and Google cached content (2024) let you load a large static context (e.g. a 100k-token document) once at a higher price and reuse it more cheaply for many requests. As of May 2026 caching cuts input cost to 10-50% of the regular price, depending on vendor and cache strategy. That makes long context for repeated evaluations of the same document economically more attractive.
Understanding recall behaviour. The needle-in-haystack test is an established benchmark in May 2026: a small specific fact ("Karl Brunner lives in Aarau, postal code 5000") is placed at various positions in a large context and the model is asked to recall it. The recall heatmap of a model typically shows: > 95% in the first 10% and last 10% of positions, then a valley in the middle, with a peak in the last 5%. Hsieh et al. (RULER, 2024) refined the test into 13 variants (multi-needle, NIAH variants, variable tracking, aggregation) and show: even top models in May 2026 deliver markedly worse results on "hard" tests from 32k-64k tokens than at the marketing maximum.
Practical rule. Use effectively 50-60% of the advertised window and keep important information at the start or the end. When you send 200k tokens to the current top Claude model, put the system instruction and the acute task at the end – that is where the model finds them most reliably.
When long context makes sense
Three scenarios where a large context window is the right choice – weighed against RAG.
Scenario 1: single, closed documents. An 80-page client contract, a 200-page requirements spec, a quarterly audit report. The task is: "summarise", "find contradictions", "extract all clauses of type X". The document fits in 128k-200k tokens; the answer comes from the document itself. Here long context is elegant and fast – RAG would be overhead.
Scenario 2: code review and larger refactorings. A software module of 5,000-30,000 lines of code fits in 200k-1m tokens. Task: "check this function for consistency with the rest of the module", "suggest a refactoring strategy". Claude Code, Cursor Agent and Anthropic Computer Use use this mode productively in May 2026.
Scenario 3: many-shot prompting for classification. Instead of a few-shot prompt with 5 examples, a many-shot prompt with 200-500 examples. Effective for hard classification tasks (domain language, rare categories). Brown et al. and Anthropic research (2024) showed that quality keeps improving with the number of examples into the 100k range – for some tasks better than fine-tuning.
Economic: when it fits. Rule of thumb May 2026: if the same big document is queried > 50 times, caching pays off and long context is also economically viable. If queried only 1-3 times, RAG is usually cheaper. If the corpus grows dynamically (e.g. a knowledge base with daily new entries), RAG fits structurally better – long context would have to be re-injected on every request.
Hybrid strategy. In May 2026 many applications build a hybrid: RAG finds the 10-30 most relevant documents, packs them into 50-150k tokens of long context and lets the model answer synthetically. This combination merges the scalability of RAG (corpus can be arbitrarily large) with the coherence of long context (the model sees all relevant sources at once).
When long context is the wrong answer
Four cases where long context produces cost without quality gain.
First: growing knowledge base. A fiduciary knowledge collection grows monthly by hundreds of new answers, receipt samples, client memos. Long context would have to re-inject the entire knowledge base on every request – quadratic cost, long latency, worse recall in the middle. RAG with a vector database (Qdrant, pgvector) scales linearly here and delivers exactly the relevant pieces.
Second: small requests. A simple question "what is the VAT rate for hairdressing services?" does not need 100k tokens of context. A direct call with a short prompt is 10-100x cheaper and faster.
Third: requests with high recall demand in the middle. If your application must reliably find every fact in the context (e.g. "extract ALL date mentions from this 500-page contract"), long context is risky because recall drops in the middle. Safer: split the document into 5-10 chunks, process each separately, aggregate results – map-reduce style. Slower but more reliable.
Fourth: compliance-sensitive documents that should not go to the vendor in cleartext. Long context means: the whole document goes in cleartext to OpenAI/Anthropic/Google. For client data under professional secrecy (SCC 321) or under revFADP sensitivity, that is critical without explicit consent of the data subjects or without EU/CH hosting of the vendor. Here RAG with masking or a local model (see self-hosted-vs-cloud-llm).
Pitfall "we just take the biggest window". In May 2026 vendors sell long context as a remedy against RAG complexity. For an SME this is rarely true. Long context solves a different problem (single large documents, code reviews, many-shot), not the RAG problem (growing knowledge bases, dynamic sources). Whoever has both cases – and that is the normal case – builds both approaches and combines them.
Trade-offs
STRENGTHS
- Whole documents, whole code, many examples in one request
- Simple architecture – no RAG stack needed for single docs
- Many-shot prompting for classification quality
- Caching makes repeated evaluations economically viable
WEAKNESSES
- Recall drops from ~50% of window size (lost-in-the-middle)
- Quadratically higher cost and latency at large inputs
- Compliance risk: whole document in cleartext to the vendor
- Replaces RAG only in special cases – unsuitable for growing corpora
FAQ
How many pages of text fit in 128k tokens?
Rule of thumb: 1 A4 page of German text approximately 500-700 tokens, depending on font size and content density. 128k tokens correspond to roughly 180-250 A4 pages. 1m tokens roughly 1400-2000 A4 pages – about a four-volume work. Note: tables, receipts and OCR-PDFs often run higher in tokens per visible content because formatting characters and layout tokens count.
What is prompt caching and is it worth it?
Prompt caching lets you load a big static context (knowledge base, document, system prompt) once at a higher price and reuse it for follow-up requests at a 10-50% discount. Anthropic (2024 GA), OpenAI (2024) and Google (2024) offer it. Worthwhile when the same context is queried > 10 times per hour (cache TTL typically 5 minutes at Anthropic, longer at Google). For rare requests caching adds nothing because the cache has expired.
Do I still need RAG with long context?
In most SME cases: yes. Long context replaces RAG only when your knowledge base a) is small enough (fits in 50-200k tokens after caching), b) is static (does not constantly grow), and c) the middle-recall pitfall is acceptable. As soon as the knowledge base becomes dynamic or large, RAG is structurally superior. In May 2026 the hybrid (RAG filters to 50-150k tokens, long context synthesises) is the dominant architecture.
Which vendor has the best long-context behaviour in May 2026?
On the RULER benchmark in May 2026 Claude Opus and Gemini 2.5 Pro lead for "hard" long-context tasks (multi-needle, aggregation). GPT-4.1 is close. Llama 4 Scout with 10m tokens has the largest marketing number but noticeable weaknesses in recall tests from 1m tokens onward. Rule of thumb: for raw size take Gemini 2.5 Pro (2m, cheapest long-context prices), for high recall demands Claude Opus.
Related topics
Sources
- Liu et al. – Lost in the Middle: How Language Models Use Long Contexts (arXiv:2307.03172) · 2023-07
- Hsieh et al. – RULER: What's the Real Context Size of Your Long-Context LMs? (arXiv:2404.06654) · 2024-04
- Anthropic – the current top Claude model Documentation, Context Limits and Prompt Caching · 2026-05
- Google DeepMind – Gemini 2.5 Long-Context Technical Report · 2026-04
- OpenAI – GPT-4.1 Pricing and Context Window Documentation · 2026-05