fairlane.systems

REGRESSION TESTING · AI CONCEPT

Regression testing for LLMs: CI/CD, snapshot tests and detecting model-update drift

CI/CD integration of LLM tests, snapshot testing for prompts, diff testing between model versions using the the current Claude model-to-4.7 fiduciary pipeline example.

Researched & fact-checked by: · As of: 2026-05

What is LLM regression testing?

Regression testing for LLMs is the automatic check of whether a change to prompt, model, retrieval logic or pipeline configuration degraded answer quality compared to the previous version. "Regression" here does not only mean "a bug is back" (classical software definition), but also "a quality metric has dropped".

Classical software regression testing checks deterministic behaviour ("function returns 42"). LLM regression testing checks semantic equivalence and statistical distribution ("faithfulness score sits at 92% ± 2%"). The check is never pixel-exact but tolerance-based.

As of May 2026, regression testing for production LLM pipelines is de facto standard in the more advanced fiduciary and legal setups. Tools like Promptfoo, DeepEval and Ragas have native CI integration. GitHub Actions, GitLab CI and CircleCI plug in directly.

The use case is clear: Anthropic releases the current top Claude model. You want to migrate because 4.7 is faster and slightly cheaper. But will booking classification accuracy on your 200-case dataset still be above 92%? Without a regression test you switch and only notice three weeks later that VAT assignment dropped from 91% to 86% – producing three months of wrong bookings in between.

Why it matters

LLM pipelines are more fragile than classical code. A three-word prompt change can double the hallucination rate. A model update can be better on 80% of cases and catastrophic on 5%. Without regression tests you see only the average – and the average is misleading.

In fiduciary practice the silent-drift problem is especially dangerous: your pipeline runs well for months, then the cloud provider quietly changes the model under the hood (a typical "model alias" like `claude-sonnet-latest` suddenly resolves to `4.7` instead of `4.6`), and your booking suggestions get worse – with no code change on your side. Regression tests running daily or at least weekly catch this.

The regulatory view: EU AI Act Art. 17 requires "post-market monitoring" for high-risk systems, i.e. operation-time surveillance. Regression tests in CI are the technical implementation of this duty. Anyone without that is not compliant.

Economically, regression tests amortise after the first detected failure. A fiduciary office reported in May 2026 that their regression test caught a Claude Sonnet-to-4.7 update as negative drift (balance-sheet plausibility accuracy fell from 89% to 81%). They did not migrate but waited until 4.7.1 came with the issue fixed. Estimated damage without the test: 30-50 hours of rework over three weeks.

How it works – patterns and tools

Snapshot testing for prompts. Like React component snapshots: each test run stores the generated answer per test case, and the CI run checks whether the new answer is "equivalent" to the snapshot. Equivalence is not measured by string match but by semantic similarity (embedding distance < threshold) or LLM judge score (>= 4 of 5). Promptfoo and DeepEval support this pattern natively.

Diff testing between model versions. Example the current Claude model → 4.7. You run 200 test cases through both models in parallel. For each case a third model judge (e.g. The current top GPT model) compares the answers: "4.7 better", "4.6 better", "tied". Result: a migration report – on which case types is 4.7 better, on which worse? Concretely on a May 2026 fiduciary project: 4.7 was better on 78%, worse on 9% – and the 9% worse cases were all VAT specialities. Migration went ahead with an additional custom prompt for the problematic sub-use-case.

Statistical tolerance testing. Instead of checking each individual case, the test checks the aggregated distribution: "faithfulness mean >= 0.85", "hallucination rate <= 5%", "P95 latency <= 4s". If aggregate metrics are in tolerance, the test is green even if individual cases look different than before. Sensible for test sets above 500 cases.

CI/CD integration. GitHub Actions, GitLab CI, CircleCI: a YAML workflow triggers the eval run on every commit. On pass: merge allowed. On fail: blocked. Typical run with 200 test cases and 3 metrics: 5-15 minutes, cost USD 3-10. Often qualified with `if: changed-files` so it only runs on prompt or model changes.

Scheduled drift detection. In addition to the PR trigger: a cron run testing daily or weekly against the live model endpoint. Catches vendor-side model updates ("model alias has changed") that happen without your commit. Results land in a dashboard (Phoenix, TruLens, or your own Grafana) tracked over time.

Tools May 2026. Promptfoo (CLI-oriented, very fast to integrate), DeepEval (Python pytest-style), Ragas (RAG-specific), Phoenix (hosted dashboard). Anthropic released its own eval API in beta in April 2026 – directly in the Claude Workbench area, good for Anthropic-only setups.

Introduce regression testing in 6 steps

  1. 01Prepare golden dataset and eval framework (DeepEval/Promptfoo/Ragas) – baseline 100 cases, 2-4 metrics.
  2. 02Run baseline with current model version and freeze results (faithfulness 0.91, hallucination 3%, latency P95 3.2s).
  3. 03Define tolerance thresholds: typically -3% absolute regression as fail threshold.
  4. 04Build CI workflow (GitHub Actions): eval run on PR trigger, block on fail.
  5. 05Set up scheduled cron (daily/weekly) against live endpoint for vendor drift.
  6. 06Set up drift dashboard (Phoenix or your own Grafana): metrics over time, alarm on anomalies.

When regression testing is mandatory

You need regression tests as soon as an LLM pipeline is live and outputs go to clients, authorities or direct booking systems. Concretely:

On every model update. Switching the current Claude model to 4.7, the current top GPT model to 5.2, Mistral-Large 2.0 to 2.1 – mandatory. Even if the update is called "minor". Model providers are often optimistic when describing changes.

On every prompt change. A seemingly harmless rephrasing can shift answer quality. A regression test against the previous prompt behaviour shows it.

On every retrieval parameter update. Changing top-k, chunking size, embedding model, reranker – all run through the eval.

On every library upgrade. Switching LangChain 0.3 to 0.4, updating the OpenAI SDK – behaviour often shifts subtly.

Scheduled for vendor drift. At least weekly, ideally daily, run against the live model endpoint to detect vendor-side updates.

For SME fiduciary setups (1-2 pipelines, < 5 staff): once-a-week cron + eval run on every PR is the lower bound. For law firms with liability-relevant pipelines: daily cron + PR trigger.

When minimal effort is enough

For pilot projects below 4 weeks runtime, a fully integrated CI/CD regression is not needed. A weekly manual run of the test script and a short note in the team channel suffices.

For low-risk internal tools (brainstorming helper, internal keyword search, knowledge lookup) the rule is: no regression test pipeline needed, but a mechanism for employees to report problems ("thumbs down" button).

Beware "we will test when we have time". That does not happen. Anyone who has deployed once without a regression test will never re-add the test until an incident hits. Build the pipeline before going live.

Another point: not every test run is a regression test. A 5-case quick check is a smoke test, not a regression test. Real regression requires a stable, statistically meaningful golden dataset (at least 100 cases) and reproducible metrics. Otherwise you measure noise.

Trade-offs

STRENGTHS

  • Catches model-update regression before production – damage visible in hours, not weeks
  • Meets EU AI Act Art. 17 (post-market monitoring) technically
  • Builds migration confidence: model switches become engineering, not a bet
  • Catches vendor-side alias drift that happens without your commit
  • CI integration possible with tools already in the stack (GitHub Actions, GitLab CI)

WEAKNESSES

  • Initial setup 5-15 days depending on pipeline complexity
  • Ongoing token cost USD 500-1,500/month for a typical SME pipeline
  • Stochastic outputs need multi-sample runs – tests are never 100% reproducible
  • Test-set maintenance is ongoing (quarterly refresh)
  • False-positive rate: too-tight tolerances incorrectly block legitimate updates

FAQ

How does LLM regression differ from unit tests?

Unit tests are binary (pass/fail) and deterministic (same input = same output). LLM regression tests are statistical (tolerance-based, e.g. "faithfulness >= 0.85 ± 0.03") and stochastic (temperature above 0 produces variation, so you run 3-5 times and average). They complement each other: unit tests for code logic (pipeline glue), LLM regression for model behaviour.

What does a running regression test pipeline cost?

At 200 test cases and 3 metrics per run: USD 3-10 per CI run. At 30 PRs per month plus a weekly cron: USD 120-400 per month in token cost. Engineer time for maintenance: 4-8 hours per month. Dashboard hosting Phoenix/TruLens (if used): USD 200-800 per month. Total for an average fiduciary pipeline: USD 500-1,500 per month – peanuts compared to the risk of an undetected regression.

How do I handle non-deterministic answers?

Three strategies. First: temperature 0 in tests (deterministic but artificial – production often runs at 0.3-0.7). Second: multi-sample run (3-5 repetitions, take median). Third: statistical tolerance (accept variance over e.g. 5 runs as bandwidth). We recommend a mix: temperature 0 for sanity checks, multi-sample for critical metrics.

How do I react to a drift alarm?

Three steps. (1) Verify: is the drift real or noise? Trigger a re-run. If drift persists: (2) localise cause – model alias changed? prompt change in the PR? library upgrade? Compare pipeline configuration with the last green version. (3) React: rollback (if possible, pin the model version) or fix (adjust prompt, validate new model state). Document the incident for the compliance file.

Related topics

EVAL FRAMEWORKS · AI CONCEPTEval frameworks for LLMs: DeepEval, OpenAI Evals, Promptfoo, Ragas, TruLens comparedGOLDEN DATASET · AI CONCEPTBuilding a golden dataset: 50-500 test examples done right for SMEsLLM-AS-A-JUDGE · AI CONCEPTLLM-as-a-judge: AI evaluates AI – methods, bias pitfalls, limitsHALLUCINATION MEASUREMENT · AI CONCEPTDetecting and measuring hallucinations: metrics, benchmarks and self-consistencyAI KPIS · AI CONCEPTMeasuring AI quality: KPIs for RAG, latency, cost and user satisfactionANTHROPIC · LLM PROVIDERAnthropic Claude from a Swiss fiduciary perspective: residency, pricing, complianceROUTING · AI CONCEPTMulti-LLM routing: which model when, for how much

Sources

  1. Promptfoo – Regression Testing for LLMs (docs) · 2026-05
  2. DeepEval – CI/CD Integration Guide · 2026-04
  3. EU AI Act, Article 17 – Post-Market Monitoring · 2024-07
  4. Anthropic – Claude Workbench Eval API (beta) · 2026-04
  5. Arize Phoenix – Drift Detection for LLM Apps · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call