fairlane.systems

TEMPERATURE / TOP-P · AI CONCEPT

What are temperature and top-p? LLM sampling parameters explained May 2026

Temperature, top-p and top-k control how deterministic or creative an LLM answers. Rules of thumb May 2026: 0-0.3 for facts, 0.7 for standard, 1.0+ for creative.

Researched & fact-checked by: · As of: 2026-05

What are temperature and top-p?

Temperature, top-p and top-k are sampling parameters that control how a language model picks the next token from its probability distributions. As of May 2026 every LLM emits per step a distribution over all vocabulary tokens – typically 50,000 to 300,000 possible tokens. From this distribution the next token must be chosen. How that choice is made is controlled by sampling parameters.

Temperature. The most important lever. It warms or cools the probability distribution. Mathematically the logits are divided by the temperature value before softmax. Higher temperature = flatter distribution = more randomness in the choice. Lower temperature = sharper distribution = the most likely token almost always wins. Typical values May 2026: 0 (purely deterministic, always the top token), 0.3 (very concentrated), 0.7 (standard chat value at OpenAI and many others), 1.0 (classical "normal" value), 1.5+ (markedly more creative, more random), 2.0 (very random, often inconsistent). Note: the exact scale is vendor-specific – Claude and GPT do not behave identically at temperature 1.0.

Top-p (nucleus sampling). An additional filter. Instead of sampling from all tokens, the model sorts tokens by probability and keeps only the top-p fraction of probability mass. Example: top-p = 0.9 means "keep only the tokens whose cumulative probability sums to 90% – the remaining 10% long-tail tokens are ignored". Holtzman et al. (2019) proposed top-p as an improvement over top-k – the "nucleus" adapts dynamically to the distribution.

Top-k. Older filter. Keeps only the k most likely tokens, ignores the rest. Typically k = 40 or 50. As of May 2026 less common than top-p because the choice of k is fixed across all contexts and does not fit the distribution as well.

The parameters interact. Temperature is multiplicative on the logits, top-p and top-k cut the long tail afterwards. As of May 2026 most vendors set defaults: temperature 0.7-1.0, top-p 0.9-1.0, top-k not active. Whoever needs consistent, factually faithful output sets temperature to 0 and ignores top-p and top-k.

Why these values matter in practice

Sampling parameters are the levers between "answer identical every day" and "answer different every day". Three business effects matter.

Effect 1: reproducibility. A fiduciary application classifying client inquiries should produce the same classification tomorrow given the same input. With temperature 0.7 (default of many models) you get 70-85% identical classification. With temperature 0 at nearly all vendors 95-100% – the small residual variance comes from floating-point effects in GPU computation. For audit-ready applications (Art. 957a CO, EU AI Act Art. 12 logging), reproducibility is not "nice to have" but an audit precondition. Also for eval suites: without temperature 0 test results are unstable and regression tests harder to interpret.

Effect 2: hallucination probability. Higher temperature activates long-tail tokens – improbable, unusual continuations. For fact tasks ("name the VAT rate for hairdressing services") that raises the probability of hallucinations because the model drifts away from the "correct" answer. For fact tasks: temperature 0 or at most 0.3. Top-p 0.9 as an extra safeguard. For creative tasks the opposite – high temperature yields more varied text that feels "human" instead of mechanical.

Effect 3: consistency vs variety in production. A customer-support application that always returns the same wording feels unnatural (every client gets the literally identical reply). One that varies too much feels unstable. Rule of thumb May 2026: temperature 0.3-0.5 for SME support bots – slight wording variation, same substance. Higher only when marketing copy or creative text is needed.

Economic aspect. Temperature and top-p cost NOTHING – they change only the sampling, not the model compute. In other words: you pay the same token price at temperature 0 and at temperature 1.5. Good: you can experiment without cost consequence. But: at higher temperature the answer can be longer or shorter (the model picks the stop token differently often), which affects output cost. On average no significant effect though.

Compliance and safety aspect. For safety-critical applications (payment generation, booking suggestions, client-data disclosure) you should set temperature 0 and record that in the audit log. For a regulatory inquiry about reproducibility ("show that the model answers consistently") you otherwise need the combination of temperature 0, fixed model version lock and prompt versioning. As of May 2026 this is increasingly relevant under EU AI Act Art. 26 deployer duties.

Mechanics in detail

Three steps explain how sampling parameters act.

Step 1: logits. The language model computes per position a vector of logits – one real number per vocabulary token. Higher logit = more likely token. Logits are the raw result of the last model layer.

Step 2: temperature. The logits are divided by the temperature value: scaled_logits = logits / temperature. At temperature = 1 nothing changes. At temperature < 1 differences between logits are amplified (the most likely token becomes relatively even more likely). At temperature > 1 differences flatten (all tokens become more equally likely). At temperature = 0 (mathematically undefined due to division by zero, in practice implemented as greedy sampling) the token with the highest logit is simply chosen.

Step 3: softmax and filter. Softmax is applied to the scaled logits – yielding a probability distribution. Top-p or top-k can then trim the distribution. Top-p: sort tokens descending, keep tokens until their cumulative probability exceeds p, zero the rest, renormalise. Top-k: keep only the k most likely tokens, zero the rest, renormalise. The remaining distribution is sampled – and that yields the next token.

Important details.

Greedy vs sampling at temperature 0. As of May 2026 the convention at all major vendors is: temperature 0 = greedy = always the top token. OpenAI, Anthropic, Google, Mistral behave identically. DeepSeek too. In a tiny fraction of applications (very long generations, exotic stacks) there are floating-point deviations between runs – the industry accepts this as "near-deterministic".

Seed parameter. OpenAI has offered since 2024 a seed parameter for reproducible generation at temperature > 0. Idea: same seed + same inputs + same parameters = same output. In practice May 2026 not 100% reliable (background: model hardware variability, vendor updates) but better than no seed. Anthropic, Google and Mistral have similar mechanisms or no seed parameter.

Vendor-specific quirks. OpenAI allows temperature 0-2, top-p 0-1. Anthropic Claude (May 2026) allows temperature 0-1 (NOT 0-2!) and top-p 0-1; values above 1 are ignored or cause errors. Google Gemini: temperature 0-2, top-p and top-k active. Mistral: temperature 0-1 recommended, technically up to 1.5 possible. Whoever builds code cross-vendor should cap temperature at 0-1 and maintain vendor mapping in the LLM gateway.

Further sampling parameters May 2026. "Min-p" (from the Llama community 2024), "Mirostat" (in some open-source stacks), "repetition penalty" (prevents the model from generating the same sentence multiple times), "frequency penalty" and "presence penalty" (OpenAI). These matter for special cases but are rarely relevant in SME daily use.

Rules of thumb per application

In May 2026 clear values have established per application type.

Temperature 0 (deterministic). Code generation, fact extraction from receipts, VAT-calculation suggestions, classification, eval suite, audit-ready applications. All cases where the same input must yield the same result. Note: the model can still hallucinate – temperature 0 prevents sample variation, not content errors. Hallucinations need RAG (see halluzinationen-begrenzen).

Temperature 0.1-0.3 (very concentrated). Structured answer generation, JSON outputs, tool-call arguments, client-inquiry classification with slight variation. Top-p usually 1.0 or 0.95.

Temperature 0.5-0.7 (standard). General chat answers, FAQ bots, knowledge assistants with RAG. Default of many vendors May 2026 sits here. Delivers slightly variable but substantially stable answers. Top-p 0.9-0.95.

Temperature 0.8-1.2 (creative). Marketing texts, slogan suggestions, brainstorming, free text generation, "imaginative" desired. Top-p 0.95-1.0. Note: hallucination risk rises – only for tasks where content correctness is not decisive or where a human reviews.

Temperature 1.3-1.8 (very creative). Storytelling, poetry, experimental texts. Very rare in SME applications.

Temperature 2.0+ (random). Practically never in production. Only for demos or experiments.

Application examples.

*Fiduciary RAG chatbot:* Temperature 0.3, top-p 0.95. Answers consistent and faithful, slight language variation for natural feel.

*Receipt recognition with vision LLM:* Temperature 0. Extraction must be deterministic – same receipt = same data.

*Marketing slogan generator:* Temperature 1.0-1.2, top-p 0.95. Varied proposals, human picks.

*Code suggestion in an IDE:* Temperature 0.2, top-p 0.95. Concentrated on the most likely correct solution, slight variation in variable names allowed.

*Mandatory report generation (annual-report draft):* Temperature 0.5, top-p 0.9. Consistent language, but naturally readable.

*Sentiment classification:* Temperature 0. Classification tasks are discrete and should be deterministic.

Cases where sampling tuning is misplaced

Three cases where sampling tweaking addresses the wrong lever.

First: setting temperature 1.5+ in production without explicit reason. The rule of thumb May 2026: NEVER above 1.2 in productive SME applications, except for pure creative tasks with human review. High temperature raises hallucination risk, lowers consistency, makes eval suites unreliable. Whoever does this often misunderstood the task.

Second: trying to solve hallucinations with sampling tuning. Temperature 0 makes the answer consistent but not necessarily true. If the model believes Swiss VAT is 19% (false), at temperature 0 it will ALWAYS answer 19% – consistently hallucinating is still hallucinating. Hallucinations are solved with RAG, refusal policy in the prompt, cross-check (see halluzinationen-begrenzen) – not with sampling.

Third: lowering top-p and temperature aggressively at the same time. Some combine temperature 0.2 with top-p 0.5 hoping "extra-safely correct". Effect: the model has almost no room left, cannot choose legitimate alternative phrasings, and exactly where variation is needed (e.g. stop-token choice in enumerations) the effect is negative. Rule of thumb: lower ONE parameter (temperature OR top-p), not both aggressively.

Pitfall "reasoning models". As of May 2026 OpenAI o1/o3, Claude Sonnet Thinking, Gemini 2.5 Pro Thinking and DeepSeek R1/V4 have special "reasoning" modes. For these models temperature 0 is NOT always the right choice – internal reasoning steps benefit from some temperature (typically 0.6-0.7 in the OpenAI recommendation). Read vendor documentation!

Pitfall "vendor default". Default values are not universal. OpenAI default temperature 1.0; Anthropic Claude default temperature 1.0; Google Gemini default 1.0; Mistral 0.7. If you work without an explicit choice you get 1.0 – usually too high for fact applications. ALWAYS set values explicitly.

Pitfall "floating-point variation". Even at temperature 0 there are minimally different outputs between vendor hardware generations or after vendor updates. For hard eval reproducibility: pin the model version explicitly, test vendor-update behaviour, evaluate several vendors in parallel.

Trade-offs

STRENGTHS

  • Free lever – no extra token cost for changes
  • Temperature 0 yields quasi-deterministic, reproducible outputs
  • Top-p cuts hallucination risk from the long tail
  • Per application clearly established rules of thumb May 2026

WEAKNESSES

  • Vendor defaults differ – setting values explicitly is mandatory
  • Temperature 0 guarantees consistency, not correctness
  • High temperature markedly raises hallucination risk
  • Reasoning models have their own recommendations – vendor docs needed

FAQ

Which value is the default at OpenAI, Anthropic and Google?

May 2026: OpenAI default temperature 1.0, top-p 1.0. Anthropic Claude default temperature 1.0, top-p 1.0. Google Gemini 2.5 default temperature 1.0 (slightly model-specific), top-p 0.95, top-k 64. Mistral default temperature 0.7. The current DeepSeek-V generation default temperature 1.0. Rule of thumb for SME applications: ALWAYS set the value explicitly, do not rely on defaults – defaults sometimes change silently with vendor updates.

Does temperature 0 guarantee 100% identical answers?

In theory yes, in practice 95-99%. Floating-point computations on different GPU generations can produce minimal differences in logits, which rarely leads to different top-token choices. For short answers (1-50 tokens) typically 99%+ identical. For long generations (1,000+ tokens) a small early deviation can amplify over the course into noticeable differences. For hard reproducibility: pin the model version, monitor vendor updates, optionally self-host a local model (see self-hosted-vs-cloud-llm).

Should I adjust temperature OR top-p?

Anthropic recommendation and May-2026 consensus: usually only tune one parameter actively. Start with temperature; leave top-p at default. If you want to concentrate answers, lower temperature. If you want to make answers more creative, raise temperature. Top-p is really only needed when you want to cap the long tail at high temperature – e.g. temperature 1.2 with top-p 0.9 for creative texts without extreme outliers.

What about reasoning models like o1 and Claude Thinking?

Reasoning models (OpenAI o1/o3, Claude Sonnet Thinking, Gemini 2.5 Pro Thinking, DeepSeek R1) run internal reasoning steps before producing the final answer. Vendor recommendations May 2026: o1/o3 from OpenAI do NOT accept temperature/top-p (fixed at 1.0) – other sampling values are ignored. Claude Thinking allows temperature 0-1; Anthropic recommends 0.7-1.0 for best reasoning quality. DeepSeek R1 recommends temperature 0.5-0.7. With reasoning models do NOT set temperature 0 – internal reasoning steps benefit from variation. Check each vendor's documentation.

Related topics

PROMPTING · AI CONCEPTPrompt engineering: foundations, patterns, anti-patternsHALLUCINATIONS · AI CONCEPTLimiting hallucinations: five countermeasures against fabricated AI answersTOKEN · AI CONCEPTWhat is a token? Tokenisers, cost, DE-vs-EN May 2026SYSTEM PROMPT · AI CONCEPTWhat is a system prompt? Role, security, best practices May 2026RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchAUDIT TRAIL · AI CONCEPTAI audit trail design: what to log so an AI answer stays audit-ready

Sources

  1. Holtzman et al. – The Curious Case of Neural Text Degeneration (Nucleus Sampling, arXiv:1904.09751) · 2019-04
  2. OpenAI – Reasoning Models Guide (Temperature Behavior in o1/o3) · 2026-04
  3. Anthropic – Claude Sampling Parameters Reference · 2026-05
  4. Google AI – Gemini Generation Parameters · 2026-05
  5. DeepSeek – Reasoning Model V4/R1 Sampling Recommendations · 2026-04

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call