AI KPIS · AI CONCEPT
Measuring AI quality: KPIs for RAG, latency, cost and user satisfaction
SME dashboard for AI quality: faithfulness, answer relevancy, context precision, context recall, latency, cost-per-query, user satisfaction.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What are AI quality KPIs?
AI quality KPIs are quantitative metrics that continuously measure productive AI pipelines. They replace "is the AI still working?" with concrete numbers like "faithfulness is at 0.91, median latency 1.4s, cost 0.018 CHF per query, user satisfaction 4.2 out of 5".
Four categories dominate dashboards as of May 2026. First: quality metrics (faithfulness, answer relevancy, hallucination rate, bias score). Second: performance metrics (P50/P95 latency, tokens-per-second, success rate). Third: cost metrics (cost-per-query, monthly burn, cost-per-user). Fourth: satisfaction metrics (user rating, thumb-up/down rate, escalation rate to humans).
For RAG pipelines specifically a four-metric suite has become standard, codified by Ragas (the leading RAG eval framework): faithfulness (answer supported by source?), answer relevancy (answer addresses the question?), context precision (are relevant chunks in the top-k retrieved?), context recall (were all relevant chunks found?).
For SME fiduciary and law setups as of May 2026 we recommend a dashboard with 8-12 KPIs. More overwhelms management. Less lets too much drift go undetected. The selection depends on the use case: a receipt capture system has different mandatory KPIs than a client response bot.
Why it matters
Without KPIs, AI is a black box. Employees occasionally complain that answers "got worse", but no one can say whether it is imagination or real drift. With a dashboard you see: faithfulness fell from 0.91 to 0.84 since last week's model update. That is fact, not gut feel.
For management and the board, KPIs are the language. A quarterly report with "faithfulness 0.91, latency P95 2.1s, cost-per-query 0.014 CHF, NPS 42" is debatable. "The AI runs" is not. Anyone needing to justify AI investments needs KPIs.
Regulatorily: EU AI Act Art. 17 requires post-market monitoring with documented metrics. FADP Art. 24 requires breach notification – KPI dashboards with anomaly alarms are the technical base. FINMA Operational Resilience Circulars require continuous performance measurement for critical applications.
Commercially: clients and principals increasingly ask for KPI reports on AI tools. A Swiss fiduciary chamber added "AI quality KPI documentation" as expected practice into its professional standard in 2025. Without it, mandates go to competitors who have it.
Internally, KPIs provide the discussion basis for optimisation. Instead of guessing whether a prompt tweak or a model switch is better, you see the KPI movement after 7 days of observation. That makes AI engineering data-driven instead of opinion-driven.
How it works – mandatory KPIs May 2026
Faithfulness (RAG mandatory). Measures whether every statement in the answer is supported by the supplied sources. Computed via Ragas or DeepEval: LLM judge decomposes the answer into atomic statements and checks each against retrieval chunks. Score between 0 and 1. Fiduciary target: > 0.90.
Answer relevancy. Measures how well the answer addresses the question. LLM judge generates hypothetical questions from the answer and compares to the real question (embedding similarity). Score 0 to 1. Target: > 0.85.
Context precision. Checks what share of top-k chunks was actually relevant to the question. High score = little "noise" in retrieval. Target: > 0.75.
Context recall. Checks whether the truly needed chunks were among top-k – compared against the golden dataset containing the correct answer plus source chunks. Target: > 0.85.
Hallucination rate. Share of answers with at least one invented statement. Measured via FActScore or via citation grounding check. Fiduciary target: < 3%. Marketing tolerance: < 10%.
Latency P50 and P95. Median and 95th percentile response time. P50 shows the typical experience, P95 shows the "bad day". Fiduciary target: P50 < 2s, P95 < 6s. Stricter for client chat bots: P50 < 1s, P95 < 3s.
Cost-per-query. Token cost per query, including retrieval embedding plus LLM call plus judge model. As of May 2026 a typical RAG pipeline runs USD 0.005-0.030. Tracking via OpenLLMetry or your own telemetry.
Throughput. Queries per second, hour or day – important for capacity planning. For larger fiduciary setups: 100-500 queries/day normal, 5000+ for automated receipt capture.
User satisfaction (thumbs up/down). The simplest UX metric: thumbs-down rate. Acceptable mark: under 8%. Over 15% is an alarm signal.
Escalation rate. Share of queries that had to be escalated to humans (AI refusal, confidence too low, user request). High rate = AI pipeline insufficient; low rate (< 5%) on critical use cases is suspicious (AI is answering where it should escalate).
Drift indicator. Mean of quality metrics over a 7-day window vs. a 30-day window. More than -3% drop = drift alarm to engineering.
Tooling May 2026. Ragas for RAG-specific metrics. DeepEval for general. Arize Phoenix or TruLens for live dashboards. Grafana with Prometheus for custom telemetry. Fiduciary SMEs with < 5000 queries/day do well with Grafana plus weekly Ragas reports.
Build a KPI dashboard in 6 steps
- 01Pick use-case-specific KPIs: 8-12 metrics, at least 3 quality, 2 performance, 1 cost, 2 user satisfaction.
- 02Define target thresholds: faithfulness > 0.90, P95 latency < 6s, hallucination rate < 3%, thumbs-down < 8%.
- 03Instrument telemetry: OpenLLMetry or your own wrapper around LLM calls – collects latency, tokens, cost, output.
- 04Set up the eval pipeline: Ragas/DeepEval runs daily against the golden dataset, writes results to DB.
- 05Build the dashboard: Grafana (SME) or Phoenix/TruLens (larger) – all KPIs visualised, drill-down per query possible.
- 06Wire alerting: Slack/Teams webhook on drift > 3%, P95 latency > target, hallucination spike > 5%.
When a KPI dashboard is mandatory
For every productive AI pipeline with external addressees (client, authority, customer) KPI tracking is mandatory. Concretely:
Receipt capture pipelines: faithfulness, field accuracy, cost-per-receipt, throughput per day.
Client response bots: answer relevancy, hallucination rate, user satisfaction (thumbs), escalation rate, latency.
RAG knowledge search: all four Ragas metrics plus user satisfaction.
Classification pipelines (dunning tier, lead score): per-class accuracy, disparate impact per subgroup, confidence distribution.
For EU AI Act high-risk systems: KPI reporting in usable form to oversight.
For FINMA-relevant applications: continuous performance monitoring per Circular 2024/4.
Minimum setup for SME fiduciary: Grafana + Prometheus + weekly Ragas reports + Slack/Teams alerts on anomalies. Effort: 5-10 days initial, then 0.5 days per week maintenance.
Larger setups (law firms, above 50 staff, multiple parallel pipelines): Arize Phoenix or TruLens Hosted. Cost USD 500-2000/month, in exchange for 24/7 dashboards plus alerting.
When minimal effort is enough
For pilot projects below 4 weeks, a full dashboard is overhead. A simple CSV file with "date, queries, fails, avg latency" and a weekly Slack update suffices.
For purely internal tools with fewer than 50 queries/week (e.g. internal knowledge lookup used only by senior staff), daily tracking is excessive. A monthly sample report by an engineer suffices.
For brainstorming tools, slogan generators or other "inspiration" use cases, most KPIs are irrelevant. What matters here is user satisfaction and throughput, not faithfulness.
Pitfall: "we will build the dashboard later". That does not happen. Anyone going live without KPI tracking flies blind. Build at least the basic metrics (throughput, latency, error rate) before production.
Another pitfall: too many KPIs. A dashboard with 30 metrics is not read. 8-12 focused KPIs are more effective than 30 "comprehensive" ones. Monthly drop whatever no one looks at.
Cost point: a full-stack dashboard (Phoenix/TruLens/own Grafana) costs 5-10 days initial plus 0.5-1 day/week maintenance. For very small pipelines (< 50 queries/day) that does not pay – a central log file and a quarterly reviewer run does the job there.
Trade-offs
STRENGTHS
- Makes AI quality measurable – discussion on facts instead of opinion
- Meets EU AI Act Art. 17 (post-market monitoring) technically
- Anomaly alerts shorten detection time from weeks to hours
- Cost-per-query tracking exposes hidden cost bombs (oversized prompts, wrong model choice)
- Trend data over time provides improvement levers – data-driven engineering
WEAKNESSES
- Initial effort 5-10 days plus ongoing 0.5-1 day/week maintenance
- Hosted platforms cost USD 500-2500/month – often overkill for very small teams
- Eval pipelines add 5-15% in token cost
- Too many KPIs cause dashboard fatigue – 8-12 focused ones are better
- False-positive alerts (e.g. after model pre-heat) need calibration
FAQ
How do I realistically measure user satisfaction?
Combine three mechanisms. First: thumbs up/down after every answer (typical response rate 5-15%, focused on negative outliers). Second: quarterly NPS survey of active users. Third: escalation rate (if users frequently click "actually ask a human", the AI is not good enough). Aggregated this gives a reliable user picture.
What does a KPI dashboard cost on an ongoing basis?
Open-source stack (Grafana + Prometheus + Ragas) self-hosted: only server cost about CHF 50-100/month plus 0.5-1 engineer day/week for maintenance. Hosted (Phoenix Pro, TruLens Snowflake, Arize Cloud): USD 500-2500/month by volume. Eval token cost: about 5-15% of production token cost since judges add calls.
How do I react to a KPI alert?
Three tiers. Tier 1 (small deviation): handle in engineering standup, localise cause (model update? data drift? bug?). Tier 2 (larger deviation, > 5%): fix or rollback within 24h. Tier 3 (critical, e.g. hallucination rate > 10% or data leak signal): take pipeline offline + incident response + check FADP notification duty. An escalation playbook with all tiers is mandatory documentation.
Which KPIs matter for management?
Four are enough: (1) user satisfaction (up or down?), (2) cost-per-query times throughput = monthly AI cost, (3) escalation rate (how many queries still need humans?), (4) drift status (is the AI stable or are there problems?). Engineering detail KPIs (faithfulness, P95 latency) belong in the engineering dashboard, not the management cockpit.
Related topics
Sources
- Ragas – RAG evaluation framework (faithfulness, answer relevancy, context precision/recall) · 2026-05
- Arize Phoenix – LLM observability and KPI dashboards · 2026-05
- TruLens – production tracing and evaluation · 2026-04
- OpenLLMetry – OpenTelemetry-compatible LLM telemetry · 2026-03
- EU AI Act, Article 17 – Post-Market Monitoring · 2024-07