HALLUCINATION MEASUREMENT · AI CONCEPT

Detecting and measuring hallucinations: metrics, benchmarks and self-consistency

How to measure hallucinations in AI answers reproducibly: TruthfulQA, HaluEval, FActScore, self-consistency and citation grounding checks.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is hallucination measurement?

Hallucination measurement is the quantitative evaluation of how often a language model produces factually wrong, invented or context-contradicting statements. The goal is not whether a single answer is correct, but the rate: how many answers out of 1000 contain at least one invented fact, a wrong citation or a contradiction with the supplied context?

Research distinguishes three categories. Intrinsic hallucinations contradict the supplied source (the model says B even though chunk A states A). Extrinsic hallucinations are invented statements without source reference (a case-law citation that does not exist). Faithfulness violations concern source fidelity in summaries.

As of May 2026, several established methods are available. TruthfulQA tests whether a model can resist common misinformation. HaluEval is a 35,000-sample dataset of generated hallucinations across dialogue, knowledge and summarisation. FActScore decomposes an answer into atomic facts and checks each against Wikipedia or another knowledge base. Self-consistency measures the variance when the same model answers repeatedly at temperature above zero. Citation grounding checks verify that each cited reference actually appeared in the retrieval result.

Why it matters

Without measurement there is no management. Anyone deploying AI in fiduciary, legal or insurance work is responsible for output quality. Gut feel is not enough: a model update from the current Claude model to 4.7 can raise the hallucination rate on a legal speciality task from 8% to 14% without being noticeable in everyday cases.

The EU AI Act, fully applicable since August 2026, requires documented evaluation procedures for high-risk applications. Art. 9 mandates a risk management system, Art. 15 requires demonstrable accuracy. Without measurement methodology you cannot meet these obligations.

Commercially the effect is direct. Clients who have twice fallen for an invented source terminate the engagement. A Zurich law firm lost a six-figure mandate in 2025 because a junior lawyer forwarded an AI answer with an invented case citation to a business client without checking. Hallucination measurement is the insurance against such incidents.

Internally, measurement provides the key lever for continuous improvement. Instead of guessing which prompt tweak helps, you see at the hallucination rate the effect of every change. That is engineering practice, not magic.

How it works

A complete hallucination measurement combines several methods because no single one catches all hallucination types.

TruthfulQA score. Published by Lin et al. in 2022, this benchmark contains 817 questions on common misinformation (medical myths, urban legends). A correct answer resists the intuitive-wrong choice. As of May 2026, the current top GPT model reaches about 75% truthful + informative, Claude Opus about 71%, Mistral Large 2.1 about 63%. In practice: you test your deployed model before production on whether it gives the correct answer on tricky facts.

HaluEval. Tencent published in 2023 a dataset of 5000 real and 30,000 generated hallucinated answers across QA, dialogue and summarisation. You measure how often your model identifies a hallucination (detection accuracy) and how often it produces one itself (generation rate).

FActScore. Min et al. 2023 developed a procedure that decomposes each generated answer into atomic claims ("Albert Einstein was born in 1879", "He received the Nobel Prize in 1921") and checks each against a knowledge source. The FActScore is the share of correct atoms. As of May 2026, the best models reach around 87% on biographical text, mid-tier models 70%, simple models below 60%.

Self-consistency (multiple sampling + vote). You ask the same model the same question three to five times at temperature 0.7. If the core statements agree, confidence is warranted. If they diverge, a hallucination is likely. Wang et al. 2022 showed that this technique improves accuracy on reasoning benchmarks by 10-20 percentage points.

Citation grounding check. Specifically for RAG setups: a post-processing layer checks that every source the model names (e.g. "[Tax Act Art. 13]") actually appeared in the retrieval hit and that the cited statement exists in that chunk. Anthropic introduced a native `with_citations` option for Claude in April 2026 that performs this check automatically.

Hallucination measurement in 6 steps

01Build a domain-specific test set: 100-300 questions with verified correct answers and sources.
02Define the metric set: FActScore for atomic facts, citation grounding for sources, TruthfulQA subset for general knowledge.
03Set up a self-consistency pipeline: same model 3 times at temperature 0.7, automatic statement comparison.
04Measure baseline with the current model and document it – that is the comparison threshold.
05CI trigger for regression: run a full measurement on every prompt change or model update, alert on deviations > 2%.
06Quarterly audit: full measurement, report for management and compliance, integrate new real-world cases.

When to measure

You measure hallucinations in three situations as a must. First: before every production deployment. A model whose hallucination rate exceeds your defined threshold (typical 3% for law/fiduciary, 8% for marketing) does not go live.

Second: after every model update. Switching from the current Claude model to 4.7, from the current top GPT model to 5.2, changing the embedding model, changing the retrieval strategy – any of these can worsen the rate. Without before/after measurement you fly blind.

Third: on suspected real-world cases. When an employee reports a wrong answer, you add the case to the regression test set. That way your test set grows over time into a high-quality, project-specific benchmark.

For fiduciary and law firms we recommend a quarterly full measurement: you run 200-500 test questions through all deployed pipelines and document the result. That is your compliance documentation toward clients, supervisors and insurance.

When the effort is not worth it

When AI outputs serve only as inspiration for human editing – brainstorming, slogan generation, image sketches – a systematic hallucination measurement is overkill. Here a qualitative quarterly sample is enough.

Similarly for purely internal low-risk tools: an internal knowledge lookup that employees always check against the original anyway does not need a FActScore pipeline. The risk profile does not justify the measurement cost.

Beware of self-deception. Many firms claim employees "check anyway", though practice shows after 14 days they do not. If your AI outputs go into a client or authority channel, they count as "external" – measurement is mandatory regardless of which check steps are supposed to sit in between.

Not every hallucination test makes sense. TruthfulQA is not trained on Swiss tax law. FActScore against Wikipedia is pointless when your knowledge base is the internal client archive. You need tests that fit your actual use case – generic benchmarks alone provide false confidence.

Trade-offs

STRENGTHS

Reproducible numbers instead of gut feel for model selection and pipeline decisions
Meets EU AI Act Art. 9 and 15 requirements for documented accuracy
Regression protection: every model and prompt change auto-checked against baseline
Compliance documentation toward clients, supervisors and professional indemnity insurance
Concrete lever for improvement – you see which prompt change produces which effect

WEAKNESSES

Test-set construction is time-intensive and needs real domain experts, not interns
Generic benchmarks (TruthfulQA, HaluEval) often miss domain specifics
Judge models (FActScore) add token cost on every run
Self-consistency misses systematic hallucinations where the model is consistently wrong
Maintenance overhead: every model generation needs test-set revalidation

FAQ

How large should my test set be?

Statistically meaningful statements require at least 100 examples per use case. For fiduciary accounting, 200 examples are realistic (booking receipts, assigning VAT, classifying dunning). For a law firm, 300-500 focused on the most common areas. Measuring with 20 examples produces gut feel, not statistics.

Who builds the test set?

The domain experts – not IT. A senior fiduciary writes the bookkeeping questions with the correct answer and source. An experienced lawyer the legal questions. AI engineers help with formatting (JSON schema, columns for expected source) but the domain experts own the content. Otherwise you measure how well the AI imitates the AI.

What does a hallucination measurement cost?

A full measurement with 300 test questions, FActScore atomic decomposition (via a large model like the current top GPT model), citation check and self-consistency costs at May 2026 prices between USD 5 and USD 40 per run. The main cost drivers are the multiple model calls for self-consistency and the judge-model call for FActScore. Initial test-set construction: 1-3 person-days per 100 examples.

Are vendor hallucination benchmarks trustworthy?

Take with caution. Vectara runs a public Hallucination Leaderboard with transparent methodology – that is serious. Other vendor reports ("our model hallucinates 40% less") often use narrow, favorable benchmarks. Trust only measurements that disclose methodology, dataset and script. For your pipeline only the measurement on your own use case ultimately counts.

Sources

Vectara – Hallucination Leaderboard (grounded summarisation benchmark) · 2026-05
Lin et al. – TruthfulQA: Measuring How Models Mimic Human Falsehoods · 2022-05
Min et al. – FActScore: Fine-grained Atomic Evaluation of Factual Precision · 2023-10
Anthropic – Claude with_citations API documentation · 2026-04
Tencent – HaluEval Benchmark Dataset · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call