fairlane.systems

HALLUCINATIONS · AI CONCEPT

Limiting hallucinations: five countermeasures against fabricated AI answers

Why language models produce plausibly wrong answers, which five remedies reduce them, and how to make hallucinations measurable.

Researched & fact-checked by: · As of: 2026-05

What is a hallucination?

A hallucination is an answer from a language model that sounds statistically plausible but is factually wrong. The model is not "lying" – it is a probability generator for next tokens, not a knowledge index. When the training material on a question is thin, the model fills the gap with whatever would statistically "fit" the question. The result: invented court rulings, wrong statute numbers, non-existent studies, wrong personal names – linguistically clean, factually invented.

Research distinguishes three types. Factual hallucination: the statement is objectively wrong (BGE 4A_123/2024 does not exist). Context hallucination: the statement contradicts source texts supplied to the model (RAG hit says A, model answers B). Logical hallucination: the statement contradicts the model's own earlier sentences in the same answer.

May 2026, the problem is not solved but better measurable. On grounded summarisation tasks, top models like GPT-4o hallucinate at around 1.5%, Claude Sonnet at about 4%. On legal speciality tasks, rates rise dramatically – a Stanford study found over 75% hallucination on US case-law research with non-specialised models. Even the strongest configurations (Claude Opus or the current top GPT model with web search) keep around 30% error rate on the HalluHard benchmark.

Why it matters

A hallucination in a marketing email is embarrassing. A hallucination in legal advice, a tax filing, or a medical recommendation creates liability. Swiss attorneys have faced disciplinary proceedings since 2024 for citing AI-invented Federal Supreme Court rulings. Fiduciary offices that accept AI booking suggestions unchecked risk GeBüV-compliant material resting on invented receipts.

Regulation amplifies the issue. For automated individual decisions under Art. 21 revDSG, you must be able to disclose the logic – if the logic is "the model just said so", your position is weak. Under the EU AI Act (applicable to Swiss providers selling into the EU), high-risk applications (justice, personnel selection, credit assessment) without documented hallucination tests are not permitted.

The trust argument: employees misled twice by an AI then stop using the tool – or worse, use it without checking because "it is mostly right". Both are bad. Anyone who limits hallucinations technically and measures them transparently builds durable trust.

Five effective countermeasures

There is no single fix – the combination works. In every productive pipeline we use at least three of the following five remedies.

1. RAG with citation-required. Instead of letting the model answer from training memory, you supply relevant passages from your document library per query. The system prompt enforces citations: "Every factual sentence must end with [source-id, page]. Sentences without source are not allowed." A post-processing layer checks that every cited source actually appeared in the retrieval hit – if not, the answer is rejected or routed to a human for review. Effect: hallucination rate on legal speciality tasks drops from 75% to under 10%, in some setups under 3%.

2. Refusal instruction in the system prompt. "If the answer does not clearly follow from the supplied material, say not in the material and propose which source should be added." Without this instruction the model is trained to "always answer" – root cause of many hallucinations. With clear refusal permission, fabrication rates fall by half. Sample wording: "If you are unsure or the information is missing, say so explicitly. Speculation is forbidden."

3. Temperature below 0.3 for factuality. Temperature controls token-choice randomness. For creative tasks (marketing copy), 0.7 to 0.9 makes sense. For factual answers (accounting, law, medicine), 0.0 to 0.3. Setting temperature 0 yields deterministic answers – same prompt, same model, same answer. This makes bugs reproducible and removes "the AI answered differently today" effects.

4. Cross-check between two models. On critical questions, run two different models in parallel (e.g. Claude Opus and the current top GPT model). A third instance compares: if they match, the answer passes; if they diverge, a human reviews. This self-consistency technique is robust: two independent models agreeing on a fabricated fact is extremely unlikely. Costs double tokens but halves practical hallucination damage.

5. Output validation against schemas. If the answer must be structured (JSON, IBAN, date, case number), validate against a schema. An invented IBAN fails the check digit, an invented case-law reference fails a format regex, an invented statute fails a lookup against the real legal index. This blocks an entire class of hallucinations.

Hallucination limiting in 7 steps

  1. 01Risk classification: rate each AI workflow by damage potential (low / medium / high / liability-relevant).
  2. 02Build a test set: 50 to 200 questions with known correct answers, covering the most common query patterns.
  3. 03Refusal instruction in the system prompt: explicitly "If unsure, say so. Speculation forbidden."
  4. 04Set temperature to 0 to 0.2 for all factuality pipelines.
  5. 05Set up RAG with citation-required + post-processing validation of sources.
  6. 06Output schemas for structured fields (IBAN, date, case number, tax number): validate in the pipeline.
  7. 07For liability-relevant workflows: cross-check with a second model and human-in-the-loop on divergence.

Which remedy when

The remedies are not equally expensive. We recommend this order: first refusal instruction (effort: 30 minutes, costs nothing), then temperature reduction (5 minutes), then output validation for structured fields (a few hours per schema), then RAG-with-citation (5 to 10 days setup), finally cross-check (pipeline extension plus double model cost).

For research tools in law firms and tax offices, all five remedies are mandatory. For accounting classification, refusal + temperature + output validation is enough. For marketing copy, temperature control is enough – hallucination here is often a feature, not a bug. For medical software, additional regulatory requirements (MDR, MepV) apply, beyond hallucination reduction.

Always measure. A typical pre-production test set has 50 to 200 questions with known correct answers. You let the model answer and compare automatically – how many answers contain invented sources, how many contradict RAG hits, how many deviate from the target output? If the rate is above 5%, the pipeline does not go live.

When tolerating hallucinations is okay

Not every AI use needs the full shield. For brainstorming, creative writing, headline generation, code sketches, idea sorting, a degree of "hallucination tolerance" is part of the value. Generating marketing slogans deliberately wants text that exists nowhere else. Describing an image concept benefits from free association.

Rule of thumb: if the output goes directly into an external addressee channel (client, authority, patient, customer) without human review in between, you need the full hallucination shield. If the output serves as input to a human who edits it anyway, you can save on remedies.

Danger zone: hallucination remedies only work when enforced strictly. A RAG pipeline that only "warns" instead of "blocking" on the citation check still delivers fabricated answers in practice – employees routinely ignore warnings after 14 days. Keep the remedies hard: what fails the schema check does not ship.

Trade-offs

STRENGTHS

  • RAG with citation-required cuts hallucination rate in law/fiduciary to below 10%
  • Temperature and refusal instruction cost nothing and ship within hours
  • Schema validation blocks a whole class of errors (invented IBANs, wrong date formats)
  • Cross-check halves damage events in liability-relevant workflows
  • Measurability: pre-production test sets give hard numbers instead of gut feel

WEAKNESSES

  • Cross-check doubles model costs – worthwhile only for critical pipelines
  • Refusal instruction can make the model "too defensive" – saying "not in the material" when the answer actually was there
  • RAG citation check needs discipline in indexing – sloppy chunks lead to sloppy sources
  • Full elimination is not possible as of May 2026 – residual risk remains
  • Test-set maintenance is ongoing – every model update needs revalidation

FAQ

Does Claude hallucinate less than GPT?

Not in general. On grounded summarisation tasks, GPT-4o has held the lower rate on the Vectara leaderboard (around 1.5%); Claude Sonnet sat around 4%, Claude Opus slightly higher. On legal speciality tasks, the picture flips: in our own tests on Swiss legal questions, Claude Opus shows fewer invented case-law references than GPT-4o-mini. Rule of thumb May 2026: for law and fiduciary use Claude Opus with RAG; for general matters the current top GPT model or Claude Sonnet are comparable.

How do I detect a hallucination automatically?

Combine three techniques: (a) citation check – verify every source the model cites against the retrieval hits; (b) self-consistency sampling – ask the same model 3 times at temperature 0.7; if answers diverge, distrust is warranted; (c) confidence scoring – modern models, on request, return a confidence value per statement. No single technique is perfect; combined they catch 80-90% of hallucinations before delivery.

What does the cross-check method cost?

You double token costs plus need a third, smaller model call for comparison (typically GPT-4o-mini or Haiku, around USD 0.50 per 1M tokens). For a pipeline with 1000 critical queries/month at 5000 tokens each, that is roughly USD 50/month extra – peanuts against the risk of a wrong legal answer. We recommend cross-check only for liability-relevant workflows, not for marketing.

Does fine-tuning help against hallucinations?

Limited. Fine-tuning on domain data reduces hallucinations in your own domain by 20 to 40% but often degrades performance elsewhere ("catastrophic forgetting"). RAG plus the five remedies above is in most cases cheaper and more reliable than fine-tuning. Fine-tuning pays off if you need a very specific output form a generic model does not reliably produce (e.g. a particular legal opinion format).

Related topics

RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsPROMPTING · AI CONCEPTPrompt engineering: foundations, patterns, anti-patternsROUTING · AI CONCEPTMulti-LLM routing: which model when, for how muchOPENAI · LLM PROVIDEROpenAI GPT models from a Swiss fiduciary perspective: residency, pricing, complianceMISTRAL · LLM PROVIDERMistral AI from a Swiss fiduciary perspective: EU residency, pricing, sovereignty

Sources

  1. Lewis et al. – Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Meta AI) · 2020-05
  2. Vectara – Hallucination Leaderboard (grounded summarisation benchmark) · 2026-05
  3. HalluHard – Multi-turn Hallucination Benchmark (legal, medical, research, coding) · 2026-04
  4. Anthropic – Claude System Prompts and Refusal Patterns (guide) · 2026-03
  5. OpenAI – Hallucination Evaluation and Mitigation · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call