ATTENTION · AI CONCEPT
What is the attention mechanism? Query, key, value explained May 2026
Attention is the heart of modern language models: every position of a sequence may attend to every other. Explained: Q/K/V, self vs cross, multi-head.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is attention?
Attention is a computation mechanism that lets every position in a sequence absorb information from every other position – weighted by how relevant they are to each other. Since the paper "Attention is all you need" (Vaswani et al. 2017) it has been the heart of the transformer architecture and thereby the foundation of all modern language models (see was-ist-transformer-architektur).
The basic principle is simpler than the mathematical notation suggests. Imagine you read the sentence "Anna gave Bob the book that she had bought yesterday". At the word "she" your language understanding must decide: does "she" refer to Anna, to Bob, to the book? Classical sequential models (RNN, LSTM) have a convoluted loss-and-recovery mechanism for this across many steps. An attention mechanism solves it in one step: the token "she" "looks at" every other token of the sentence, weighs them by relevance, and gathers the matching information – here with high weight on "Anna".
Formally there are three derived quantities per token: query, key and value. Through linear transformations three new vectors are produced from the token vector. The query asks the question ("who am I, what do I want to attend to?"), the key offers itself as a potential answer ("I am this reference available"), the value carries the actual information transferred. The similarity between query and key (typically a dot product followed by softmax) determines the weight at which the value flows into the answer.
As of May 2026 attention is standard not only in language models but also in image models (Vision Transformers, DiT), audio models (Whisper, AudioLM) and multi-modal models (CLIP, GPT-4V, Gemini, Claude with image input). The concept spans the entire modern AI landscape.
Why it matters
Attention solves three problems where older architectures failed – and at the same time creates a concrete practical consequence SMEs can feel.
First: long-range dependencies. Language is full of references across distance. A clause on page 12 of a contract refers to a definition on page 2. A pronoun at the end of a long paragraph refers to the subject at the beginning. Classical RNNs and LSTMs lose such references through "vanishing gradients" – information fades along the way. Attention sees every position directly, can model any distance as long as the context window allows (see was-ist-context-window). This capability explains why today's models understand documents at which 2018-era models failed.
Second: parallelisability. Attention computes all position pairs at once – matrix operations GPUs love. RNNs would have to compute step-by-step because every step depends on the previous one. On modern GPU clusters attention is 10-100x faster in training. That is the main reason language models have improved so dramatically in the last eight years: faster iteration on larger data with larger models.
Third: interpretability (conditional). Attention weights are visible. You can visualise which tokens paid strong attention to which – yielding insights into pronoun resolution, coreference or syntactic dependencies. As of May 2026 the research community is no longer as euphoric as 2018-2020: attention weights are not necessarily causal explanations ("attention is not explanation", Jain & Wallace 2019), but they are a useful tool for model debugging.
Practical SME consequence: O(n^2) cost. Attention scales quadratically with sequence length. Twice the input = four times the compute. This is the most important practical implication in May 2026. Whoever sends a 100k-token document does not pay 10x the cost of a 10k-token request – but rather 30-50x, depending on the vendor. Practical rule: long context (see was-ist-context-window) is an economic decision, not a default. Whoever wants to save money shortens inputs via RAG instead of passing the whole corpus through.
In May 2026 the field is dampening the O(n^2) complexity through algorithmic tricks. FlashAttention (Dao et al. 2022, FA-3 July 2024) makes the computation memory-efficient and 2-4x faster. Sliding-window attention (Mistral since 2023) breaks O(n^2) by not connecting every token to every other one – only locally plus a few global tokens. Sparse attention (Longformer, BigBird, Reformer) picks a thin set of global connections. Ring attention (Gemini 1.5/2.5) distributes the computation across many GPUs. These tricks enable the 1-2m-token windows but are not free: certain dependency patterns are lost, recall in the middle drops (see was-ist-context-window).
Mechanics in detail
Attention has three stages running in every transformer block.
Stage 1: derive Q, K, V. From the input vector of each token (typically 768-12288 dimensions) three new vectors are produced – query (Q), key (K), value (V). Mathematically these are three linear transformations: Q = X * W_Q, K = X * W_K, V = X * W_V. The matrices W_Q, W_K, W_V are learned parameters found during training. Q and K typically share dimension d_k; V can have a different dimension but usually equals d_k.
Stage 2: compute similarity and weights. For every token pair (i, j) the dot product Q_i * K_j is computed – a similarity score. Then divided by sqrt(d_k) (scaling trick for training stability) and pushed through softmax into a probability distribution. Result: attention weight alpha_ij – how much attention token i pays to token j. The sum of weights for a given source position is 1.
Stage 3: mix values. The output at position i is the weighted sum of values: output_i = sum_j(alpha_ij * V_j). Each token receives a new representation, a mixture of all tokens of the input weighted by relevance.
Self-attention vs cross-attention. In self-attention Q, K, V all come from the same sequence – the standard case in decoder-only models (GPT, Claude, Llama) and in encoder-only models (BERT). In cross-attention Q comes from one sequence (e.g. the so-far generated output) and K, V from another (e.g. the input to be translated). Cross-attention is central in classical encoder-decoder models (T5, BART, Whisper). As of May 2026 self-attention is the dominant case.
Multi-head attention. Instead of one attention call 8-128 heads are computed in parallel. Each head has its own W_Q, W_K, W_V with smaller dimension. The outputs of all heads are concatenated and projected through another linear layer. Effect: each head can learn a different dependency pattern – one attends to syntax, one to pronoun resolution, one to semantic proximity, one to order. Empirical studies (Clark et al. 2019, "What does BERT look at?") show that certain heads consistently cover certain reference types. This diversity is a central reason for the understanding depth of modern models.
Causal mask in decoder models. GPT-style models generate token-by-token – position t may only look at positions <= t, not into the future. A mask sets alpha_ij = 0 for all j > i before softmax is applied. This enforces the autoregressive property.
FlashAttention. Dao et al. (2022) showed that the naive attention implementation fills GPU memory with large intermediate matrices. FlashAttention computes attention in blocks inside fast GPU SRAM without ever materialising the full nxn matrix. Result: 2-4x faster, dramatically lower memory, identical mathematical result. FA-2 (2023) and FA-3 (July 2024) brought further efficiency gains through async processing. As of May 2026 FA-2/FA-3 is standard in every serious inference stack (vLLM, TGI, SGLang, llama.cpp).
Sub-quadratic tricks. Sliding-window attention (Mistral, Longformer) lets each token attend only to a local window (e.g. 4,096 preceding tokens). Sparse attention (BigBird) mixes global, local and random references. Linear attention (Performer, Linformer) approximates attention with linear instead of quadratic complexity – in some open-source models as of May 2026, but not adopted in top frontier models because quality drops slightly. Mamba and state-space models (Gu & Dao 2023, Mamba-2 2024) are an alternative architecture without attention, interesting for very long sequences – still experimental in the frontier class as of May 2026.
When this knowledge becomes practical
You do not implement attention. Frameworks (Hugging Face Transformers, PyTorch nn.MultiheadAttention, vLLM, TGI, SGLang) handle that. But three practical consequences concern you directly as an SME.
First: inference-stack selection. When you self-host open-source models (Llama, Mistral, Qwen, DeepSeek) you choose between stacks like vLLM, TGI (HuggingFace), SGLang or llama.cpp. All implement FlashAttention in the current version. As of May 2026 performance differences are typically 10-30%, depending on model size, batching behaviour and hardware. vLLM is the most popular stack for GPU servers, llama.cpp for smaller self-host scenarios on CPU or Apple Silicon. See vergleich-lokale-llm-runtimes.
Second: long-context selection. When you need long context ask the vendor concretely: which attention variant runs in long-context mode? Gemini 2.5 (ring attention) delivers the best long-context recall as of May 2026. Mistral with sliding window is efficient but less accurate for some dependency patterns. OpenAI and Anthropic do not disclose their attention implementation in detail but measure up well on the RULER benchmark.
Third: hardware planning. Attention memory grows quadratically. A 200k-token sequence on a 70-billion-parameter model needs 20-80 GB of GPU memory depending on the stack, just for attention intermediates (before FlashAttention it was 100-400 GB – FA saves dramatically here). Whoever self-hosts long context plans GPU RAM with a safety buffer.
Fourth use case: multi-modal models. Models like GPT-4V, Gemini, Claude with image input use cross-attention between image tokens (from the vision encoder) and text tokens. Whoever builds multi-modal applications (receipt recognition with vision LLM see ai-belegerkennung-ocr, product-image classification) should understand: the model mixes two token streams. Latency and cost scale with the sum of both – images are converted into 85-2,000 tokens depending on resolution.
Where attention is NOT the right explanation. When the model gives wrong answers the cause is almost never "the attention is badly trained". Causes are: training-data gaps (hallucinations, see halluzinationen-begrenzen), prompt clarity (see prompt-engineering-grundlagen), RAG quality (see retrieval-augmented-generation). Whoever searches at the attention level usually searches in the wrong place.
When attention depth does not help
Three cases where SMEs should not deal with attention.
First: you search the reason for bad model answers. As of May 2026, 95% of quality problems in the SME space are: bad prompts, bad RAG quality, bad data, wrong model choice. Very few are attention-related. Whoever says "we need to tune the attention better" has identified the wrong problem in 19 out of 20 cases.
Second: you want to "fine-tune" an existing model to change attention patterns. That is research work, not an SME project. Even LoRA fine-tuning (more efficient variant) usually delivers only 5-15% quality improvement over good prompt engineering plus RAG – at high engineering cost and new compliance obligations (see was-ist-fine-tuning-vs-rag).
Third: you want to choose an "attention-free" stack. As of May 2026 Mamba/state-space models are interesting but not yet mainstream. The frontier class (GPT-4.1, the current top Claude model, Gemini 2.5, Llama 4) is attention-based. Whoever wants to do without attention must hunt in open-source for special models (Mamba-Codestral, RWKV-7) – and accept that the quality curve for SME standard tasks trails mainstream models.
Recommendation. Read Vaswani et al. once, watch a 20-minute explanation (3Blue1Brown on YouTube or Karpathy "Let's build GPT") – and then leave attention alone. You have data to sort, prompts to write and eval suites to build.
Trade-offs
STRENGTHS
- Models references at arbitrary distance without loss
- Parallelisable – fits modern GPU architectures
- Interpretable (conditional) via visible weights
- Universally applicable to text, image, audio, multi-modal
WEAKNESSES
- O(n^2) complexity – long context expensive and slow
- Memory grows quadratically – FlashAttention dampens, does not remove
- Sub-quadratic tricks lose some dependency patterns
- Attention weights are not reliable causal explanations
FAQ
What is the difference between self-attention and cross-attention?
In self-attention query, key and value all come from the same sequence. A token attends to other tokens of the same input. Standard in decoder-only models (GPT, Claude, Llama) and encoder-only models (BERT). In cross-attention the query comes from one sequence (e.g. the so-far generated output) while key and value come from another (e.g. the input to be translated). Standard in encoder-decoder models (T5, BART, Whisper). As of May 2026 self-attention is the dominant case in text models; cross-attention remains central in translation, image-to-text and audio-to-text.
What does O(n^2) complexity mean in practice?
Double the input length means four times the compute and four times the memory for the attention layer. A 20k-token request costs computationally four times as much as a 10k-token one – and 16x as much as a 5k-token one. As of May 2026 FlashAttention and sliding-window tricks dampen the effect but do not remove it. Practical: long context is an economic decision. If you frequently send 100k tokens, RAG (knowledge filter to 10-20k tokens before model call) is typically 5-15x cheaper than long context.
Why are there many heads in multi-head attention?
Every head learns its own dependency pattern. Empirical studies (Clark et al. 2019, "What does BERT look at?") show: certain heads consistently cover certain linguistic phenomena – pronoun resolution, subject-verb relations, semantic proximity, coreference. With 8-128 parallel heads a model can model many different reference patterns simultaneously. Increasing head count does not yield boundless gains – beyond 16-32 heads in mid-sized models the scaling curve flattens.
Does Mamba solve the attention problems?
Mamba (Gu and Dao 2023, Mamba-2 2024) is a state-space architecture without attention, with linear instead of quadratic complexity. As of May 2026 interesting for very long sequences (genome analysis, very long code bases) and in some hybrid models (Jamba by AI21, Mamba-Codestral). In SME practice as of May 2026 not mainstream – the top models (GPT-4.1, the current top Claude model, Gemini 2.5, Llama 4) remain attention-based with FlashAttention plus long-context tricks. Worth monitoring, not yet worth deploying.
Related topics
Sources
- Vaswani et al. – Attention Is All You Need (arXiv:1706.03762) · 2017-06
- Dao et al. – FlashAttention-3: Fast and Accurate Attention with Asynchrony (arXiv:2407.08608) · 2024-07
- Clark et al. – What Does BERT Look At? An Analysis of BERT's Attention (arXiv:1906.04341) · 2019-06
- Gu and Dao – Mamba: Linear-Time Sequence Modeling with Selective State Spaces (arXiv:2312.00752) · 2023-12
- Hugging Face – Attention Implementation Documentation · 2026-04