PRETRAINING · AI CONCEPT

What is pretraining? How an LLM learns its base capability May 2026

Pretraining is the self-learning phase in which a language model absorbs language, grammar and factual knowledge from 10-15 trillion tokens of text. Explained: data sources, cutoff, cost.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is pretraining?

Pretraining is the first and most expensive phase in the life cycle of a language model. In this phase the model reads billions of texts and learns to predict the most likely next word fragment. From this one learning task – "predict the next token" – everything follows indirectly: grammar, factual knowledge, languages, programming skills, reasoning patterns. That is the central insight of the LLM era: a single simple training signal suffices when the data volume is large enough.

The approach is called self-supervised learning. No one needs to label the training data manually. Every word in every sentence is its own label: the model sees the words before and must predict the current. If the prediction is correct, the model is content; if not, it adjusts its internal parameters (weights). With 10-15 trillion tokens of training data and 70-700 billion parameters, that yields trillions of individual learning steps.

As of May 2026 pretraining is a game for tech giants. Realistic training cost for frontier models: Llama 3.1-405B (July 2024) USD 60-100 million GPU time; the current top GPT model (early 2026) USD 300-500 million estimated; Gemini 2.5 Pro also in the USD 200 million class. These figures exclude research, data acquisition and compliance – full cost typically doubles. For an SME pretraining is therefore out of reach. Whoever wants an "own model" thinks of fine-tuning (see wie-trainiert-man-eigenes-modell), not pretraining.

After pretraining the model is a "base model" – language-capable but unhinged. It completes any text without knowing what a question and what an instruction is. The transformation into a helpful assistant happens in the next phase, instruction tuning and RLHF (see was-ist-rlhf).

Why pretraining matters for SMEs

Even without ever training a model yourself, pretraining choices touch your fiduciary or SME daily life directly. Three consequences.

First: the cutoff date. Pretraining ends on a fixed cutoff. All world events, legal changes and market data AFTER are unknown to the model. May 2026: the current top Claude model cutoff January 2026, the current top GPT model cutoff October 2025, Gemini 2.5 Pro cutoff December 2025. Whoever asks the model about the revised VAT ordinance 2027 gets either an outdated answer or a hallucinated one. Consequence: time-critical fact questions must be anchored via RAG (own knowledge base), web search tool or tool use – not from model memory.

Second: the training material shapes bias. As of May 2026 English material dominates – estimates for Llama 3.1: 89% English, 1.8% German, 1.6% French. Consequence: German answers tend to be weaker than English answers – typically 5-15% quality gap in independent benchmarks (HELM, MMMLU, MEGA). Whoever wants highest German quality checks the current DeepSeek-V generation, an upcoming Mistral Large generation or Gemini 2.5 Pro (all with relatively high DE share in pretraining).

Third: data compliance is an open wound. Pretraining data contains CommonCrawl (web scrape without consent), GitHub code (with licence conflicts), possibly also books (lawsuit avalanche 2023-2026 in the US and EU). As of May 2026 the legal situation in Germany and Switzerland is in flux: the EU AI Act demands a "training-data summary" as mandatory transparency, the copyright reform package is expected 2026/27. For SME users practically: document which model you use and which compliance commitments the vendor makes. Vendors with clear pretraining documentation (Mistral, Anthropic, Cohere) have an edge in May 2026 over vendors with unclear data provenance.

Strategic consequence. Pretraining is the ceiling of what a model can do at all – fine-tuning can only sharpen within that ceiling. Whoever overlooks the pretraining data mix (DE/FR/IT share, code share, cutoff date) in the model-selection process selects in the dark.

Pretraining in detail

A full pretraining run breaks into four phases: data acquisition, data cleaning, training, evaluation.

Phase 1: data acquisition. As of May 2026 the standard sources are: CommonCrawl (5-7 trillion tokens of web text, 200+ languages, filtered), Github (1-2 trillion tokens of code, all common languages), ArXiv (200-400 billion tokens of scientific papers), Wikipedia (40-80 billion tokens encyclopaedic), book scans (300-700 billion tokens, high quality but licence-rich), Stack Exchange (100-200 billion tokens Q&A), Reddit (300-500 billion tokens of dialog, filtered by subreddit quality). Total typically 10-15 trillion tokens after deduplication – corresponding to about 50 million books.

Phase 2: data cleaning. From raw data the actual training corpus is distilled. Steps: language detection (discards wrong-language texts), quality filter (removes SEO spam, boilerplate, auto-generated text), toxicity filter (removes hate speech, violence descriptions), PII filter (removes personally identifiable data), deduplication (removes duplicates that would overfit the model), decontamination (removes texts contained in test benchmarks). From 50-100 trillion raw crawl tokens, 10-15 trillion training tokens remain. Data quality beats data quantity – the lesson of 2023-2026 in countless ablation studies (Llama 3 paper, Mistral reports, DeepSeek paper).

Phase 3: training. A cluster of 10,000-32,000 H100 or H200 GPUs processes the corpus over 2-4 months. Per GPU-hour costs on hyperscaler cloud USD 2-6 (May 2026); in own datacenter (Meta, Microsoft, Google) about USD 0.8-2. Training runs autoregressively: the model sees 4,000-32,000 tokens of context, predicts the next token, compares with truth, adjusts parameters. This loop runs over all 10-15 trillion tokens, typically in 1-2 passes (epochs). Hardware requirement for Llama 3.1-405B: 16,000 H100 GPUs over 54 days, around USD 60 million pure compute (Meta, July 2024 report).

Phase 4: evaluation. Before release the model runs benchmark suites: MMLU (1,500 multiple choice from 57 subjects), HumanEval and MBPP (code generation), GSM8K and MATH (mathematics), MMMLU (multilingual MMLU), HELM (broad-scope eval, over 40 scenarios), and proprietary vendor suites. As of May 2026 the eval landscape is fragmented – no single score captures model quality fully. For SME selection multilingual benchmarks (MMMLU DE subset, MEGA) and domain-specific evals (fiduciary, legal, accounting Q&A) matter more than the much-quoted general scores.

Understand pretraining in 5 steps

01Accept: pretraining is a past phase in which the model learned from 10-15 trillion tokens of world text. You use the result, not the process.
02Check the cutoff date of every model you deploy – all world events thereafter are unknown to the model.
03Check the data mix of the model (DE/FR/IT/EN, code, books) via the publicly available model card or datasheet.
04Understand the ceiling: pretraining is the upper bound of model capability. Fine-tuning and RAG sharpen but do not break it.
05Make the model choice with pretraining awareness: Claude/Mistral for DE quality, DeepSeek for cost, Llama for open-weight, Gemini for multilingual.

When pretraining knowledge becomes practical

Four concrete SME decisions in which pretraining knowledge tips the scale.

Decision 1: model selection. When picking between the current top Claude model, the current top GPT model, Gemini 2.5 Pro, Llama 4 and the current DeepSeek-V generation, pretraining mix is the main factor for German-language quality. As of May 2026 the current top Claude model and an upcoming Mistral Large generation typically lead in German fiduciary tests. The current DeepSeek-V generation is very cost-efficient but has more English bias. Gemini 2.5 Pro has broad multilingual presence. Llama 4 is open-weight (good for self-hosting) with solid German.

Decision 2: check cutoff date. Before using a model for a task, check the cutoff date. If the question to answer arose after that date (tax reform, statute change, market data), you need RAG or a web-search tool. May 2026 standard cutoffs: the current top Claude model Jan 2026, the current top GPT model Oct 2025, Gemini 2.5 Pro Dec 2025, Llama 4 Maverick March 2025, the current DeepSeek-V generation Sep 2025, an upcoming Mistral Large generation Dec 2025.

Decision 3: read compliance contracts. Vendors with transparent pretraining data (Mistral cards, Anthropic model cards, Cohere datasheets) enable audit-ready use. Vendors silent on data sources create risks in EU AI Act audit (Art. 50 transparency duty). For fiduciary and law firms: pick vendors who at least disclose pretraining sources at category level.

Decision 4: expectation management. Pretraining creates language capability, not a world model. Whoever asks a model a tax detail question whose answer was in no single guideline in the pretraining corpus gets hallucinated answers – no model rescues that. Whoever understood that builds RAG binding instead of "let us try with the current top GPT model".

When pretraining know-how does not help

Three cases in which pretraining know-how adds no value – or becomes a trap.

First: SMEs plan no pretraining of their own. Even a "small" model (7-13 billion parameters) needs 50-200 H100 GPUs over 2-6 weeks – USD 500,000 to 3 million plus data acquisition, personnel and compliance. As of May 2026 pretraining is a game for 50+ organisations globally, not for fiduciary offices. Whoever thinks "we train our own model" practically always means fine-tuning or RAG (see wie-trainiert-man-eigenes-modell).

Second: the pretraining data mix cannot be "retrained". If you use an upcoming Mistral Large generation and want more German competence, you cannot "just load more German into pretraining". Pretraining is a past phase. Whoever wants more DE competence checks model switch or fine-tuning on a German domain.

Third: the pretraining cutoff cannot be shifted to "tomorrow". The cutoff date is the state of the training data – no API configuration changes that. Whoever needs current data builds RAG, tool use with web search or fine-tuning on own up-to-date data. "Please update your knowledge" as a prompt does not work.

Trap "free pretraining". Vendors like Hugging Face Hub release open weights – Llama, Mistral, Qwen, DeepSeek. The model is free, the pretraining was not. Whoever self-hosts open-weight models bears inference costs (hardware, electricity, maintenance), not pretraining costs. That gap is large: 13B self-hosting on an A100 costs about CHF 1.5-3 per hour inference, the pretraining would have cost about USD 1-3 million.

Trap "pretraining looks at the internet". As of May 2026 NO standard LLM has live internet access – everything is pretraining echo. For live data the model needs a web-search tool (Anthropic Brave Search integration, OpenAI browse tool, Perplexity, Gemini Google Search). That is tool use, not pretraining.

Trade-offs

STRENGTHS

Creates language and world knowledge from a simple training task (predict next token)
Self-supervised: no manual labelling of training data needed
Scales with data volume – more tokens = more capability (up to a ceiling)
The pretraining outcome (open-weight or API) is cheaply accessible to SMEs in May 2026

WEAKNESSES

Cutoff date: model knows nothing after the deadline – updates need re-training or RAG
Data bias: 60-70% English overshadows DE quality
Cost USD 60-500 million per frontier model – not for SMEs
Compliance risk: licence, copyright and PII disputes at sources

FAQ

What does pretraining really cost?

May 2026 for frontier models: Llama 3.1-405B USD 60-100 million pure compute (Meta report July 2024), the current top GPT model USD 300-500 million estimated, Gemini 2.5 Pro USD 200+ million. Full cost (data, personnel, research) typically doubles. For a 13-billion-parameter model USD 1-3 million, for a 70B model USD 8-25 million. Pretraining is tech-giant territory – SMEs build on top, not from scratch.

Why does the model not know current events?

Because pretraining ends on a cutoff date. All world events, statute changes or market data after are unknown to the model. May 2026: the current top Claude model cutoff Jan 2026, the current top GPT model Oct 2025, Gemini 2.5 Pro Dec 2025. For current knowledge the model needs RAG (own knowledge base), web-search tool or tool use with database binding. "Please update your knowledge" as a prompt does not work.

Why is German underrepresented in pretraining?

Because the internet is 60-70% English. CommonCrawl and over 90% of publicly available text (libraries, ArXiv, Github) are English. As of May 2026 German is typically the second or third largest share (1.5-3%) – much in absolute terms (50-300 billion tokens), little in relative. Consequence: DE quality 5-15% weaker than EN quality with most models. Mistral and DeepSeek have relatively higher DE shares and are strong in the DE fiduciary domain May 2026.

Can pretraining be "forgotten"?

Partially. With unlearning techniques (May 2026 research from Anthropic, OpenAI, Meta) a model can selectively forget copyright-violating texts or PII. But a full "pretraining reset" equals new pretraining – USD 50-500 million effort. Practically for SMEs: irrelevant. Vendors solve pretraining problems via output filters (refusal model, content moderation), not actual unlearning.

Sources

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call