LLM-AS-A-JUDGE · AI CONCEPT

LLM-as-a-judge: AI evaluates AI – methods, bias pitfalls, limits

GPT-4 and Claude as eval judges, pairwise vs pointwise scoring, position bias and self-preference, the G-Eval paper, when humans remain indispensable.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is LLM-as-a-judge?

LLM-as-a-judge means a language model evaluates the quality of another (or itself). The concept was systematically investigated by Zheng et al. in 2023 in the paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" and has been the de facto method for scalable LLM evaluation ever since.

The idea is pragmatic: humans evaluate AI outputs reliably, but slowly and expensively (around USD 5-15 per rated answer). A large model like the current top GPT model or Claude Opus can deliver the same rating in 2 seconds for USD 0.01. If the judge is good enough, this scales to millions of answers.

As of May 2026, almost every eval framework (DeepEval, Ragas, Promptfoo) uses an LLM judge in the background for complex metrics: faithfulness, answer relevancy, helpfulness, toxicity. Anthropic introduced Constitutional AI in 2024 as a special approach where a model judges itself against a constitution (RLAIF – Reinforcement Learning from AI Feedback).

Two variants dominate. Pointwise scoring: the judge rates a single answer on a scale (e.g. 1-5 for faithfulness). Pairwise comparison: the judge sees two answers (A and B) and decides which is better. Pairwise is more robust against scale drift but is computationally heavier.

Why it matters

Without an LLM judge, evaluation does not scale. Having humans fully evaluate a 500-case golden dataset on every model update costs USD 5,000 at USD 10 per rating – per model update. At monthly updates that is USD 60,000 a year just for evaluation. With an LLM judge the cost runs at USD 5-50 per full measurement.

The practical question is not "judge yes or no" but "how reliable is my judge?". Zheng et al. showed in 2023 that GPT-4 as judge on MT-Bench reaches 80% agreement with human experts – that is better than inter-annotator agreement among human crowdworkers (66%). As of May 2026, Claude Opus and the current top GPT model reach 85-90% agreement with senior experts on well-defined scales.

That is good enough for the majority of production pipelines. But: the 10-15% noise is not random – it sits systematically in certain case classes. These bias patterns must be understood before judge results are taken as truth.

For compliance the judge has a double role: it provides the quantitative eval pipeline EU AI Act Art. 15 demands, and at the same time the judge itself needs validation. You periodically measure the agreement rate of the judge against a senior human expert on a sample – if it falls below 70%, the judge is unsuitable for this use case and must be replaced by a human or a different judge.

How it works – methods and bias traps

Pointwise scoring. The judge sees the question, the generated answer and (optionally) a reference answer. It returns a score: "Rate the faithfulness of the answer to the source on a 1-5 scale. 1 = contradicts the source, 5 = fully supported by the source." Advantage: fast, one model call per answer. Disadvantage: scale drifts – what is a "3" today can be a "4" after the next model update.

Pairwise comparison. The judge sees the question plus two answers A and B and decides "A better", "B better" or "tied". More robust against scale drift but twice as expensive and not directly comparable across different pairs. Often combined with Elo rating (as in Chatbot Arena).

G-Eval (NLP paper 2023). Liu et al. propose a special pointwise variant: have the judge first generate chain-of-thought reasoning, then the score. Reasoning steps are weighted by the logit probabilities. Result: 0.51 Spearman correlation with human judgment on summarisation (vs. 0.42 without CoT).

Constitutional AI / RLAIF. Anthropic's method: the judge follows an explicit constitution ("is the answer helpful? harmful? correct?"). The model judges its own answer, generates self-critique and improves it. Used in Anthropic training itself and open source since 2025 as the Constitutional AI library pattern.

Position bias. When the judge always sees A before B in pairwise, it favours the first position by about 5-12 percentage points. Fix: run every question twice – once A-B, once B-A – and average results.

Self-preference bias. A GPT judge rates GPT-generated answers slightly higher than Claude-generated ones, and vice versa. Effect: 3-8% per May 2026 studies. Fix: pick the judge model independent of the generator model, ideally from a different vendor.

Length bias. Judges prefer longer answers – often because more detail looks "competent". Even when the longer answer contains 20% hallucination. Fix: explicitly state in the judge prompt "answer length is NOT a quality criterion" and consider length normalisation.

Self-consistency bias. A judge agrees more with an answer that matches the judge model's own world view. Particularly problematic in political, ethical or values-oriented questions.

Introducing LLM-as-a-judge in 6 steps

01Choose the judge model: different vendor from the generator, high capacity (das aktuelle GPT-Spitzenmodell, Claude Opus).
02Write a judge prompt with clear scale definitions, examples for every score level, bias notes (no length bias).
03Decide pointwise or pairwise: pointwise for scalable metrics, pairwise for critical A/B comparisons.
04Validate against humans: have 100-200 cases rated in parallel by the judge and a senior expert, compute agreement rate.
05At agreement > 80%: deploy the judge in production with periodic sample validation (10% by humans).
06At agreement < 80%: sharpen the judge prompt, swap the judge model or declare the use case to be human-rated.

When LLM-as-a-judge makes sense

You use LLM judges for scalable eval pipelines: faithfulness in RAG (answer supported by source?), answer relevancy (does the answer address the question?), helpfulness (is the answer usable for the user?), toxicity and bias (does the answer contain problematic content?).

For fiduciary applications: automatic scoring of dunning email drafts ("is the tone professional and not aggressive?"), classification quality of receipts ("is the booking category of the proposed answer plausible?"), client answer quality ("does the answer address the client's question understandably?").

In law firms: pre-filter for AI-generated brief drafts, quality check of clause suggestions, evaluation of legal research summaries.

You choose a judge by: (1) different vendor than the generator (against self-preference bias), (2) high capacity (typically the current top GPT model, Claude Opus, Gemini 2.5 Pro), (3) when possible pairwise instead of pointwise for critical decisions.

When the judge is not enough

For liability-relevant decisions, a human must additionally review. When an AI output goes to a client without a check and the client builds business decisions on it, "the judge says 4 of 5" is not enough. As of May 2026 the consensus among Swiss bar oversight and the FINMA environment is: judge for filtering and sorting fine, but final sign-off by a human at high stakes.

For value-laden questions (ethically, politically, culturally sensitive) the judge is unreliable. Which dunning tonality is "culturally appropriate" for a Ticino client vs. a Zurich client? Which tax-optimisation recommendation is "in the client's interest"? Here human judgment is indispensable.

For very new topics the judge model knows little about (e.g. a May 2026 tax reform after the training cutoff), the judge is clueless. You need either human evaluation or RAG attachment to current sources on the judge path too.

Rule of thumb: per 1000 answers have 50-100 reviewed by humans on a random sample and compared to the judge. If the agreement rate falls below 75%, the judge is too weak for this class of answers and must be supplemented with humans.

Trade-offs

STRENGTHS

100x cheaper than human rating on well-defined scales
Scales to millions of answers in hours instead of weeks
Reproducible: same input + same judge + temperature 0 = same result
Integrated into all major eval frameworks (DeepEval, Ragas, Promptfoo)
Constitutional AI pattern enables self-correcting pipelines

WEAKNESSES

Systematic bias patterns: position, self-preference, length, self-consistency
Scale drifts across judge model versions – comparison over time needs pinning a specific model version
Value-laden and ethical questions still need human judgment
Judge-vs-human validation is ongoing, not a one-time setup
On rare or new topics the judge's knowledge cutoff becomes a problem

FAQ

Which model makes the best judge as of May 2026?

The current top GPT model and Claude Opus are typically about tied in published studies, at roughly 85-90% agreement with senior experts on well-defined scales. Gemini 2.5 Pro sits slightly below (82-87%). Mistral Large 2.1 as a judge is rather weak (70-78%) – Mistral's strength is in generation, not judging. Rule of thumb: combine generator and judge across vendors (generator OpenAI, judge Anthropic or vice versa) against self-preference.

How do I measure judge quality?

Inter-annotator agreement between judge and human on a 100-200 case sample. Cohen's kappa or Pearson correlation. Values above 0.75 are good, above 0.85 very good. Important: the sample must represent the real answer distribution, not just easy cases.

Can the same model generator also be the judge?

Technically yes, but not recommended due to self-preference bias. If unavoidable (single-vendor setup), use a different model tier (generator: the current top GPT model-mini, judge: the current top GPT model-Opus) and be more conservative with score thresholds. For single-vendor Anthropic: generator Sonnet, judge Opus.

What does a judge run cost?

One Claude Opus rating at around 1500 tokens input + 500 tokens output costs about USD 0.015 as of May 2026. A 200-case dataset run is USD 3-5. The current top GPT model is slightly more expensive (USD 0.02 per rating), Gemini 2.5 Flash much cheaper (USD 0.002) but agreement rate drops there. Well-spent money.

Sources

Zheng et al. – Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · 2023-12
Liu et al. – G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment · 2023-05
Anthropic – Constitutional AI: Harmlessness from AI Feedback (paper) · 2022-12
Anthropic – RLAIF and Constitutional AI library (docs) · 2026-03
DeepEval – LLM-as-a-Judge metric implementations · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call