fairlane.systems

RLHF · AI CONCEPT

What is RLHF? Reinforcement Learning from Human Feedback explained May 2026

RLHF turns a raw base model into a helpful assistant. Three phases: SFT, reward model, PPO. Plus comparison with DPO, Constitutional AI and RLAIF May 2026.

Researched & fact-checked by: · As of: 2026-05

What is RLHF?

RLHF, Reinforcement Learning from Human Feedback, is the training phase that turns a raw language model after pretraining (see was-ist-pretraining) into a helpful, polite, safe assistant. Before RLHF the model writes any text continuation – string of curses, self-harm guide, conspiracy theory. After RLHF it refuses harmful requests, answers helpful questions clearly and follows instructions.

The procedure was popularised by OpenAI with InstructGPT (January 2022) and ChatGPT (November 2022). Since early 2023 RLHF is standard with all top models: Claude, GPT, Gemini, Llama Instruct, Mistral Instruct. As of May 2026 the family has diversified – DPO (Direct Preference Optimization) has partially replaced the classic PPO step, Constitutional AI (Anthropic) and RLAIF (Reinforcement Learning from AI Feedback) scale the procedure without human raters.

The core idea stays: humans (or another model) say "answer A is better than answer B" to model outputs. From these preferences the model learns what "good answer" means – not via rigid rules but via statistical reward. That is a different learning signal than pretraining: not "predict the next token" but "produce the answer humans prefer". This second stage shapes the model character.

As of May 2026 RLHF is an umbrella term for a family of related procedures: classic PPO RLHF (OpenAI, Llama 2), DPO (Stanford research 2023, May 2026 standard at Mistral and open-source), Constitutional AI with RLAIF (Anthropic Claude family), and hybrid procedures with online learning and reward-model refresh. For SME users the practical rule: every serious business model is RLHF-trained, every base model is not.

Why RLHF matters for SMEs

Even without ever training a model yourself, RLHF touches your fiduciary or SME work directly. Four consequences.

First: usable answers instead of raw text echo. A base model often produces, after the question "What is the VAT rate for accounting services?", an answer that contains the word "accounting" – but no clear "8.1% standard rate". After RLHF the model answers the question structured, with context and disclaimer when unsure. This answer form is not self-evident – it is explicitly trained in. Whoever uses open-weight models must pick "Instruct" or "Chat" variants, not raw base models.

Second: refusal of harmful requests. An RLHF-trained model refuses requests for weapons construction, self-harm, criminal instructions. As of May 2026 this safeguard is active across all commercial models. For SMEs practically: no compliance risk from accidentally generated bomb-making instructions – but also potential frustration ("model is too cautious"). Whoever wants to adjust refusal behaviour can do so via system prompt or custom fine-tuning – within the vendor acceptable-use policy.

Third: model character and tone. Claude tends to be elaborate-careful, GPT tends to be terse-direct, Gemini tends to be formal, Llama Instruct tends to be pragmatic. These stylistic differences are no accident – they come from the RLHF training set and rater preferences. Whoever needs a specific tone for fiduciary client answers checks style match via a test suite (10-30 typical client questions, comparing several models).

Fourth: instruction following. RLHF-trained models follow instructions in the system prompt – "Answer only in German", "Answer in at most 200 tokens", "Use Sie form" – reliably. Base models do not. As of May 2026 instruction following is the strongest differentiator: top models (the current top Claude model, the current top GPT model) over 95% compliance in complex multi-instruction tests, mid-field (Gemini, Mistral) at 80-90%, weaker models at 60-75%.

Strategic consequence. RLHF is the phase in which the model turns from "text generator" to "employee tool". Whoever looks only at pretraining data (data mix, cutoff) in the model selection process overlooks RLHF quality – which is at least as important for business use. Fiduciary tests should run 20-30 typical inquiries across several models and rate: clarity, refusal behaviour, instruction following, hallucination rate.

RLHF in three phases

Classic RLHF from the InstructGPT paper (OpenAI, 2022) breaks into three sequential phases.

Phase 1: Supervised Fine-Tuning (SFT). Human demonstrations are collected. Annotators receive questions or tasks and write model-good answers. Typical data volume: 10,000-100,000 question-answer pairs. The base model is further trained on these pairs (classical supervised learning) until it learns the format "question → structured answer". After SFT the model is a "raw assistant" – can follow instructions but has no fine preference steering yet.

Phase 2: reward model training. Annotators receive 2-4 model answers each to the same question and rank them by preference (best, second-best, worst). From these preferences a reward model is trained – a second model that can predict a numerical reward for ANY answer. "Human raters would on average be satisfied with this answer at score 7.3/10." Data volume typically 50,000-500,000 comparison pairs. The reward model is NOT a final answer model – it is an auxiliary for phase 3.

Phase 3: reinforcement learning (PPO). The SFT model from phase 1 is further trained with Proximal Policy Optimization (PPO, a reinforcement-learning algorithm). Loop: model answers a question, reward model gives a score, model parameters are adjusted so future answers receive higher scores. A KL penalty term is used to prevent drifting too far from the SFT state (avoiding "reward hacking"). Phase 3 is the most expensive – it needs the reward model, the training model, multiple GPU copies for parallel sampling, typically 7-30 days wallclock on 100-1,000 GPUs. Estimated cost for a 70B-model PPO phase: USD 1-10 million.

Variant DPO (Direct Preference Optimization). May 2026 standard at many open-source models (Mistral, Qwen, Llama OS derivatives). DPO skips the explicit reward model and the PPO sampling. Instead a loss is computed directly from the preference pairs that trains the model towards the preferred answer and away from the dispreferred. Advantages: simpler to implement, 3-10x cheaper, more stable training dynamics. Disadvantages: less room for complex reward structures. Anthropic and OpenAI still use hybrid procedures (DPO-like building blocks plus classic PPO), DeepSeek and Mistral mostly DPO in May 2026.

Variant Constitutional AI (Anthropic, 2022 paper, May 2026 in the current top Claude model). Instead of human annotation of preferences, ratings are produced by an AI rater – based on a "constitution" of 30-100 explicit principles ("answers are helpful, harmless, honest"). This saves annotation cost and makes values transparently documentable. Variant RLAIF (Reinforcement Learning from AI Feedback) is the generic form: an AI rater replaces human raters completely or partially. As of May 2026 RLAIF is standard for scaling steps; full-human RLHF is used for "final polish".

Practical cost May 2026. Complete RLHF phase for a top model: USD 5-30 million (reward-model training, PPO compute, annotator cost). Annotator cost alone: USD 2-10 per comparison pair with qualified annotators (multilingual, domain-skilled), totalling USD 1-5 million for 500,000 pairs. Constitutional AI / RLAIF cuts these costs by factor 5-10.

Understand RLHF in 5 steps

  1. 01Understand the three phases: SFT (human demonstrations), reward model (preference annotation), PPO or DPO (RL training).
  2. 02Distinguish base model (raw text continuator) and instruct/chat model (RLHF-trained, assistance-capable) – for business always the RLHF variant.
  3. 03Check per model the instruction following with 10-20 real fiduciary requests – RLHF quality is not automatically pretraining quality.
  4. 04Understand the limits: RLHF shapes style and refusal, not factual knowledge. Facts come from RAG or fine-tuning on domain-specific material.
  5. 05Make the model choice with RLHF awareness: Anthropic Claude (Constitutional AI), OpenAI GPT (classic PPO + DPO mix), Mistral/Llama (DPO), DeepSeek (DPO mix).

When RLHF knowledge becomes practical

Three concrete occasions in which RLHF understanding tips the scale.

Occasion 1: open-weight model selection. When you self-host Llama 4, Mistral or DeepSeek, you MUST pick the Instruct or Chat variant, not the base model. Hugging Face lists both; the base model is typically named "llama-4-maverick-base", the RLHF-trained one "llama-4-maverick-instruct" or "-chat". Whoever accidentally loads the base model gets unusable text continuation instead of Q&A behaviour.

Occasion 2: fine-tuning your own model. When you fine-tune on an open-weight model (see wie-trainiert-man-eigenes-modell), you typically start from the Instruct version. That saves your own RLHF phase. But careful: aggressive fine-tuning can overwrite RLHF behaviour – the model becomes "rawer" and can lose refusal properties. In sensitive sectors (fiduciary, law, medicine) a compliance check should follow fine-tuning.

Occasion 3: system prompt engineering. Because RLHF trains instruction following, the system prompt is the most important steering tool. Top models follow 5-10 explicit instructions (language, tone, format, prohibition, refusal clause) reliably. Weaker models "forget" instructions after 200-500 tokens of conversation. Whoever writes system prompts tests on the target model and adapts count/complexity to RLHF quality.

Occasion 4: rating refusal quality. As of May 2026 models differ strongly in refusal behaviour. Too-strict refusals (Claude tendentially, the current top GPT model with "safety mode") block also legitimate requests ("How do I write a dunning letter with legally correct threats?"). Too-lax refusals (some open-source Mistral derivatives) create compliance risk. Fiduciary offices check via a test set: compare 10-20 borderline requests across 3-4 models, clarify whether behaviour fits the sector.

Occasion 5: hallucination understanding. RLHF reduces hallucination – but not to zero. The reward model rewards self-confident, fluent answers – which can lead to the model answering self-confidently even when it knows nothing. Constitutional AI with explicit "honesty" principles (Anthropic) and refusal training (OpenAI) cut that. For evidence-bound applications RAG (see retrieval-augmented-generation) remains the more important tool than RLHF alone.

When RLHF does not solve the problem

Three cases in which RLHF is not the right entry point.

First: factually wrong answers. RLHF shapes style, tone and instruction following – but it gives the model no new factual knowledge. If the model after pretraining believes the standard VAT rate is 7%, RLHF cannot fix that fact. Factual fidelity improves through RAG (source binding) and fine-tuning on domain-specific material, not through more RLHF preference data.

Second: hard deterministic rules. Whoever needs "if request is tax advisory, then refuse due to StBVG" as a HARD rule (legal duty) cannot rely on RLHF persuasion. The model typically learns it in 95-98% of cases but refuses incorrectly in 2-5%. For hard rules: output filter, refusal wrapper or external classifier before the model call.

Third: specific writing style. RLHF trains on broad rater preference – that is an average. If you want a specific style (fiduciary house style, law firm memo style), fine-tuning on 200-2,000 style examples is more effective than a long system-prompt try. RLHF gives you "helpful-careful", not "Swiss law-firm dryness".

Trap "RLHF makes the model safe". RLHF reduces obvious harms (bomb, self-harm, crime instruction), but as of May 2026 sophisticated jailbreaks (multi-step manipulation, role-play tricks) remain feasible. Whoever builds compliance-critical applications cannot rely on RLHF refusal alone – they need output filter, audit log, escalation to human review on suspicious input.

Trap "we train our own RLHF". Full RLHF is USD 5-30 million effort. Even DPO (cheaper) costs USD 200,000-2 million for a meaningful run. For SMEs RLHF self-training is not economical in May 2026. Instead: take an existing RLHF model and adapt with fine-tuning (LoRA, see wie-trainiert-man-eigenes-modell) for your tasks.

Trade-offs

STRENGTHS

  • Turns a raw language model into a helpful assistant
  • Implements refusal behaviour for harmful requests
  • Improves instruction following dramatically (from 30% to 95%+)
  • As of May 2026 standard in all commercial models

WEAKNESSES

  • Very expensive: USD 5-30 million for frontier models, USD 0.5-3M even for DPO
  • Brings no new factual knowledge – only style and behaviour
  • Reward hacking possible – model becomes confident in hallucinations
  • Annotator value corpus or constitution subtly shapes model behaviour

FAQ

What is the difference between RLHF and DPO?

Classic RLHF (PPO) first trains a reward model and uses it for reinforcement learning. DPO (Direct Preference Optimization, Stanford 2023) skips the reward model and trains the model directly on preference pairs. DPO advantages: 3-10x cheaper, more stable, simpler to implement. Disadvantages: less flexibility for complex reward structures. As of May 2026 the standard at open source (Mistral, Llama, DeepSeek), while Anthropic and OpenAI use hybrid procedures.

What is Constitutional AI?

Anthropic procedure (2022 paper, May 2026 in the current top Claude model). Instead of human annotation of preferences, an AI rater is used with a "constitution" (30-100 explicit principles like "helpful, harmless, honest"). Advantages: scales without annotator cost, values are transparently documentable, very consistent. Practical consequence: Claude family has characteristically clear refusal rules and is rather cautious – explicitly derived from the constitution principles.

Can I "turn off" RLHF?

Not via API parameter. RLHF is trained into the model weights and not deactivable at runtime. Whoever needs "unhinged" output must use a base model (from Hugging Face, "*-base" variants of Llama 4, Mistral, Qwen). As of May 2026 commercial API models (Claude, GPT, Gemini) are available only in RLHF form. Whoever wants to adjust refusal behaviour can do so via system-prompt engineering or custom fine-tuning (within the vendor AUP).

Is RLAIF worth it for SMEs?

No, for SMEs even RLAIF is too expensive. RLAIF (Reinforcement Learning from AI Feedback) cuts RLHF cost by factor 5-10 – but still USD 500,000-3 million per meaningful run. As of May 2026 the SME strategy is: take an existing RLHF model (Claude, GPT, Mistral Instruct) and adapt with fine-tuning (LoRA, USD 5-50k) to your own domain. RLAIF is relevant for vendors, not for users.

Related topics

LLM BASICS · AI CONCEPTHow does an LLM work? Autocomplete on steroids, explained for SMEs May 2026PRETRAINING · AI CONCEPTWhat is pretraining? How an LLM learns its base capability May 2026OWN MODEL · AI CONCEPTHow to train your own AI model? Fine-tuning, LoRA, QLoRA May 2026FINE-TUNING vs RAG · AI CONCEPTFine-Tuning vs RAG: which approach fits when? Status May 2026HALLUCINATIONS · AI CONCEPTLimiting hallucinations: five countermeasures against fabricated AI answersSYSTEM PROMPT · AI CONCEPTWhat is a system prompt? Role, security, best practices May 2026ANTHROPIC · LLM PROVIDERAnthropic Claude from a Swiss fiduciary perspective: residency, pricing, compliance

Sources

  1. Ouyang et al. – Training Language Models to Follow Instructions with Human Feedback (InstructGPT, arXiv:2203.02155) · 2022-03
  2. Bai et al. – Constitutional AI: Harmlessness from AI Feedback (Anthropic, arXiv:2212.08073) · 2022-12
  3. Rafailov et al. – Direct Preference Optimization (DPO, arXiv:2305.18290) · 2023-05
  4. Anthropic – the current top Claude model System Card and Alignment Disclosure · 2026-05
  5. Lee et al. – RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (arXiv:2309.00267) · 2023-09

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call