LLM BASICS · AI CONCEPT
How does an LLM work? Autocomplete on steroids, explained for SMEs May 2026
A language model predicts the most likely next word fragment. Explained in five stations: tokenisation, embedding, transformer, logits, sampling – without maths.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is an LLM, really?
A Large Language Model, or LLM, is at its core a very large probability function. It takes a text prefix and estimates which word fragment will most likely come next. That is the entire magic – repeated millions of times per answer.
The sober analogy: autocomplete on steroids. Your phone suggests the next three words while you type. An LLM does technically the same, except it learns not from your personal typing history but from trillions of tokens of internet, book and code text. It writes the likely continuation token by token. If the input reads "Dear Ladies and", the model estimates that "Gentlemen" comes next with high probability, "Colleagues" with medium, "Men" with very low.
Important for the business mindset: an LLM does not understand, it estimates. It has no world model, no intent, no memory of yesterday request. What looks like understanding is statistical echo of its training data. That fact explains both the strengths (linguistically fluent, often topically apt) and the weaknesses (hallucinates with full conviction, can confuse facts).
As of May 2026 the main representatives are the current top Claude model (Anthropic), the current top GPT model and GPT-4.1 (OpenAI), Gemini 2.5 Pro (Google), Llama 4 Maverick (Meta), an upcoming Mistral Large generation (Mistral) and the current DeepSeek-V generation (DeepSeek). All share the same base architecture – transformer with self-attention – and differ mainly in size, training data, tokeniser and fine-tuning. For a fiduciary management the architecture matters less than the consequences: what an LLM can do reliably, where it hallucinates, what it costs per request.
Why the mechanics matter
Whoever broadly understands the inside of an LLM makes better business decisions. Three points are decisive for fiduciary and SME leaders.
First: the model computes on probability, not truth. If the training material is wrong or covers a topic only thinly, the model still estimates the "likely" answer – and lands wide. As of May 2026 this remains visible in tax detail questions, in date statements and in quota calculations. Whoever pulls a VAT quota from a raw language model, without RAG or tool use, accepts an error risk. That is not a "bug that will be fixed" but a consequence of the mechanics.
Second: piece-by-piece generation explains latency and cost. The model produces the answer token by token, autoregressively. A 600-token answer takes about 3-12 seconds of generation time depending on vendor and load. That is not speed-up-able with "faster servers" – token order is sequential by design. Whoever needs real-time answers (e.g. voice bot) must cap answer length or use streaming.
Third: the context window is finite. The model cannot keep "the entire client file" in view at once. As of May 2026 standard context windows range from 128,000 to 2,000,000 tokens – much, but not infinite. Too-long context lowers answer quality (lost-in-the-middle effect) and raises cost linearly. RAG (see retrieval-augmented-generation) is the right answer for large knowledge bases, not "stuff more into the prompt".
For management practically important: an LLM is no database replacement, no accountant, no lawyer. It is a text generator that becomes a reliable colleague with the right scaffolding (RAG, tool use, audit log, refusal prompt). The architecture explains why that scaffolding is needed – and why "we just use ChatGPT" is not enough.
Five stations of an LLM answer
From button press to finished answer a request traverses five stations. Each is well explainable without mathematics.
Station 1: tokenisation. The input text is split into tokens – small word fragments, typically 3-5 characters. A German "Mandantenanfrage" becomes for instance "Mand", "ant", "en", "anfrage" – four tokens. These tokens are the building blocks for everything that follows. Every model has its own tokeniser; what costs 80 tokens at OpenAI can be 75 or 85 at Claude. See was-ist-token for details.
Station 2: embedding. Every token is converted into a vector – a list of typically 4,096-12,288 numbers (May 2026 status for large models). Similar tokens land near each other in the space: "Mandant" and "Klient" have very similar vectors, "Mandant" and "Pizza" have vectors that lie far apart. These vectors are what the model internally really "sees".
Station 3: transformer layers. The vectors traverse 32-128 so-called transformer layers (May 2026 typically 60-100 for top models). Every layer has two main components: self-attention (the model "looks" at all other tokens of the input to understand context) and feedforward network (computes intermediate results). Self-attention is the core; see was-ist-attention-mechanismus for the detail. After all layers the model has an "understood" vector for each position – saturated with context, grammar and factual knowledge from training.
Station 4: logits. From the last vector the model computes a score for every possible next token. With a vocabulary of 100,000 tokens that is 100,000 scores – called "logits". High score = likely continuation, low = unlikely. These scores are converted into probabilities by the softmax function (all together summing to 100%).
Station 5: sampling. From the probability distribution a concrete token is chosen. Three strategies are common: greedy (always the most likely), temperature (random with a stretching parameter, see was-ist-temperature-top-p), top-p (only consider the cumulatively most likely). May 2026 default for business applications: temperature 0.3-0.7. Low = more deterministic, high = more creative.
This cycle (embedding → transformer → logits → sampling) repeats for EVERY output token. A 600-token answer goes through the cycle 600 times. That explains why generation is sequential and why output tokens are billed 3-5x more than input tokens – the compute per output token is substantially higher.
Understand an LLM in 5 steps
- 01Accept the principle: an LLM predicts the most likely next word fragment – no understanding, no world model, no memory across requests.
- 02Separate the five stations mentally: tokenisation, embedding, transformer, logits, sampling. Each has its own costs and levers.
- 03Check per use-case: does the task need language (yes → LLM) or exact calculation (no → tool use or rule engine)?
- 04Estimate token volume and cost per request: input tokens times input price plus output tokens times output price. May 2026 typically CHF 0.001-0.05 per client answer.
- 05Anchor critical answers with scaffolding: RAG for source citations, tool use for arithmetic, audit log for traceability, refusal prompt against hallucination.
When an LLM is the right choice
An LLM is the right choice when the task must produce or understand natural language, when the answer need not be 100% exact (or is anchored by RAG) and when the value of the answer exceeds token cost.
Concrete SME applications May 2026: triage client inquiries and pre-route (see ai-mandantenanfragen), email triage and draft replies, generate dunning letters, compare contract clauses, document recognition with vision LLM (see ai-belegerkennung-ocr), summarise meeting recordings, generate multilingual answers (DE/FR/IT/EN). In all these cases the model is a productivity lever: a fiduciary employee handles 2-5x more inquiries per hour with LLM support at comparable quality.
For evidence-bound tasks (tax detail answer, statute citation, accounting booking) an LLM is only right when augmented with RAG (source binding) and tool use (calculator, database query). A raw LLM must never be the "final answer" in a fiduciary workflow without a safety net.
When an LLM is NOT the right choice
Three cases in which an LLM is the wrong choice.
First: exact numerical calculation without tool use. An LLM can compute "3.45 times 27 plus VAT 7.7%" correctly often – but not reliably. As of May 2026 LLMs are wrong in 5-15% of complex multi-step calculations. For accounting bookings, VAT quotas or tax calculation a calculator tool MUST be wired in (tool use, function calling, see was-ist-tool-use-function-calling), not the raw text generator.
Second: deterministic rule application. If the task is "if invoice sum greater than CHF 1,000, then review", a simple if-then rule in the accounting software is cheaper, faster and 100% correct. LLM adds no value here, but costs tokens and may decide randomly differently.
Third: highly sensitive data without compliance architecture. Whoever sends client files under professional secrecy (Art. 321 SCC) through a US model without a data residency contract risks criminal complaint. Clarify the compliance architecture first (see dsgvo-und-llms, berufsgeheimnis-stgb-321-ki), then pick the model – not the other way round.
More traps: tasks with hard reproducibility requirements (e.g. "same input must always produce same output") are problematic with non-deterministic sampling strategies – temperature 0 helps but with some vendors does not guarantee bit-exact reproducibility. And: an LLM is not a search index. Whoever searches "all contracts with clause X" needs a database query or vector index, not an LLM prompt with the whole contract collection in the context.
Trade-offs
STRENGTHS
- Universal tool for language tasks – understand, generate, translate, summarise
- Pay-per-token: no fixed-cost block, scales with usage
- Multilingual out-of-the-box (DE/FR/IT/EN with all top models)
- May 2026 mature API ecosystems with audit log, RAG, tool use
WEAKNESSES
- Hallucinates without a safety net – not for evidence-bound answers without RAG
- Output latency 3-12 seconds, sequential, not parallelisable
- Costs scale linearly with volume – no economy-of-scale magic
- Understanding is statistical echo, no world model – limits in multi-step reasoning
FAQ
Does an LLM really understand what it reads?
No, not in the human sense. It estimates very well which answer statistically fits an input – that looks like understanding. As of May 2026 there is still no scientific agreement on whether LLMs develop a weak form of understanding. For business applications the practical rule: treat the model as a very capable intern, not a responsible senior. Verify answers, do not let the model alone decide critical questions.
Why are LLMs so expensive in output?
Because every output token requires a full pass through all transformer layers. With 80-100 layers and a 70-billion-parameter model that is many compute operations per token. Input tokens are processed in parallel, output tokens sequentially – making output 3-5x more expensive. As of May 2026: Claude Sonnet USD 3 input / USD 15 output per million tokens, the current top GPT model USD 5 / USD 25.
Do I need my own model?
As of May 2026 for 95% of SMEs: no. Training your own model (from scratch) costs USD 50-500 million and needs a team of 30+ specialists. Fine-tuning an existing open-source model (Llama 4, Mistral) is feasible from CHF 5-50k and adapts the model to your style or domain vocabulary. For most applications an API model (Claude, GPT, Gemini) plus RAG is enough (see was-ist-fine-tuning-vs-rag).
Why do LLMs hallucinate?
Because they are optimised for probability, not truth. When the model has no strong evidence in the training echo for a question, it still estimates a plausible-sounding answer – even if that is invented. Countermeasures: RAG (source binding), refusal prompt ("if unknown, say so"), citation checks, low temperature. Hallucination is decreasing in May 2026 but not zero.
Related topics
Sources
- Vaswani et al. – Attention Is All You Need (arXiv:1706.03762, Transformer-Originalpaper) · 2017-06
- Anthropic – the current top Claude model Model Card and Architecture Overview · 2026-05
- OpenAI – the current top GPT model Technical Report and Pricing · 2026-04
- Stanford CRFM – Foundation Model Transparency Index 2026 · 2026-03
- Hugging Face – Open LLM Leaderboard v3 · 2026-05