BIAS & FAIRNESS · AI CONCEPT
Bias and fairness audits for AI: Swiss equality law, EU AI Act Art. 10, BBQ and StereoSet
How SMEs measure bias in LLM outputs: Swiss equality law, EU AI Act Art. 10, BBQ benchmark, StereoSet, CrowS-Pairs and a fiduciary example on language accent.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What are bias and fairness audits?
Bias and fairness audits check whether an AI system treats people or groups systematically differently – based on characteristics like gender, age, language, origin, social status. "Bias" is the statistical skew in output, "fairness" the normative requirement that this skew does not lead to unlawful discrimination.
Bias sources in LLM pipelines are diverse. First: training data – if the 2023 web was dominantly male-Western-English-speaking, the model internalises matching patterns. Second: prompt design – phrasings can amplify expectations ("a successful CEO" is more often described as male). Third: retrieval setup in RAG – if indexed documents are predominantly German-language, French-speaking clients are disadvantaged. Fourth: decision threshold – a classification threshold well calibrated for one group can produce misclassifications for another.
As of May 2026 several established bias benchmarks are available. BBQ (Bias Benchmark for QA) from NYU tests social bias in QA settings across nine demographic categories. StereoSet tests stereotype susceptibility on gender, profession, race and religion. CrowS-Pairs (Crowdsourced Stereotype Pairs) measures a model's preference for stereotype-conforming versus anti-stereotypical sentences.
For Swiss practice these tests are a starting point, not an end. Swiss equality law (Const. Art. 8, Equal Opportunities Act) and EU AI Act Art. 10 (data quality, "free of biases") set the legal frame. Concrete application tests must be language- and cultural-region-specific.
Why it matters
Swiss law sets clear limits. Const. Art. 8 prohibits discrimination by origin, race, gender, age, language, social position, lifestyle, religious, ideological or political conviction, or physical, mental or psychological disability. The Equal Opportunities Act (GlG) makes this concrete for employment. Under FADP Art. 21 on automated individual decisions, the affected person has a right to reasoning – algorithmic discrimination must be disclosable and correctable.
EU AI Act Art. 10 requires for high-risk systems that training, validation and test data are "free of biases" to the extent technically achievable. Bias-test methodology must be presentable to oversight.
The fiduciary use case: client routing. If the AI auto-routes incoming client requests to case workers, sorting by "language quality", that can be discriminatory – a client with a Ticino Italian accent in a German-language email is systematically classified as "complex" and forwarded to junior staff, while a Zurich-German client goes to senior advisors. That is both legally problematic (Const. Art. 8 – language) and commercially (mandates are treated unequally).
Commercially fairness is a trust signal. Clients, employees and supervisors increasingly expect documented fairness tests. Without that you lose mandates to competitors who have them.
How it works – methods and benchmarks
BBQ (Bias Benchmark for QA). Parrish et al. 2022 (NYU). Test set with 58,000 questions across nine demographic categories (age, disability, gender identity, nationality, physical appearance, race/ethnicity, religious preference, social status, sexual orientation). Each question has two variants: ambiguous (insufficient info, correct answer = "don't know") and disambiguated (sufficient). Bias is measured when the model gives stereotype-conforming answers in the ambiguous case. Current frontier models reach very low bias scores near 0 (the goal) today, while older and open models sit considerably higher.
StereoSet. Nadeem et al. 2021 (Microsoft / Carnegie Mellon). Multiple-choice: given a context sentence, three possible completions: stereotype-conforming, anti-stereotype, unrelated. Measures whether the model prefers stereotype-conforming or anti-stereotypical completions – either is a bias signal. Ideal value: 50/50 between stereotype and anti-stereotype (model has no systematic preference).
CrowS-Pairs. Nangia et al. 2020. Crowdsourced Stereotype Pairs. 1500 sentences, each as a pair (e.g. "He works as a doctor" / "She works as a doctor") – bias score = share of model preferences for the stereotype pair. Simpler than StereoSet, hence often used as a quick check.
HELM (Holistic Evaluation of Language Models). Stanford CRFM, continuously updated. Aggregates several bias benchmarks (including BBQ) plus performance tests in a standardised framework. As of May 2026 the reference for academically rigorous bias assessment.
Custom bias tests for the Swiss context. Generic benchmarks are often English-only. For DE-CH applications we build custom tests covering Swiss realities: does the AI treat a "Frau Müller" differently from a "Herrn Müller"? Are queries with French vocabulary in German emails answered with equal quality? Does the AI classify dialect-tinted requests as "less serious"? Such tests are built as an extension of the golden dataset – with 20-50 cases per axis.
Disparate impact analysis. For classification pipelines: check threshold per subgroup. If the AI hits 90% accuracy for one group and 70% for another, that is disparate impact. With legitimate performance differences (e.g. less training data for a language), this must be openly reported and compensated – either with additional training or with human-review support for the less-covered group.
Mitigations. When bias is detected: (a) prompt tuning ("treat all queries equally regardless of ..."), (b) output filter with bias detector, (c) model switch to less biased model, (d) additional training data for the underrepresented group, (e) human-in-the-loop escalation for suspicious cases.
Bias audit in 6 steps
- 01Identify risk axes: for which attributes (gender, language, age, region) is bias likely problematic?
- 02Run standard benchmarks: BBQ, StereoSet, CrowS-Pairs on the deployed model – measure baseline.
- 03Build custom Swiss tests: 20-50 cases per axis, language- and cultural-region-specific.
- 04Disparate impact analysis on real application data: accuracy, latency, quality per subgroup.
- 05Prioritise and mitigate findings: prompt tuning, output filter, model switch, human review backstop.
- 06Quarterly repetition, oversight report annually or on substantial pipeline change.
When bias audits are mandatory
For automated individual decisions under FADP Art. 21 – always. When the AI decides on mandate acceptance, lead scoring, credit score or HR pre-selection, fairness testing is mandatory.
For applications on the EU AI Act high-risk list (Annex III) – always. Justice, HR, education, credit, insurance, critical infrastructure. EU AI Act Art. 10 is mandatory here.
For client routing and client-response pipelines in fiduciary and law firms – recommended even without hard regulatory pressure. Reputation damage on detected discrimination is high.
For multilingual setups (DE/FR/IT/EN) – always. Language disparity is the most common bias class in Swiss applications. What hits 92% accuracy in German often only reaches 78% in Suisse-Romand French.
On model switches – bias profile changes. Different models and generations have different bias patterns. Test before migration.
For purely internal tools with no external decision effect (brainstorming, knowledge lookup), an annual check suffices.
When the effort is lower
For purely generative use cases without decision impact (slogan generator, image concept sketches, internal note suggestions), a full bias audit suite is overkill. An annual quick check with CrowS-Pairs on the deployed model suffices.
For pipelines whose output goes exclusively to humans with full review competence (senior lawyer, senior fiduciary) without further automation: a one-off initial audit, then annually. The human is the fairness backstop.
Beware: bias mitigations can introduce new bias. A prompt instruction "treat all queries equally" can cause the model to drop legitimate differentiation (e.g. for legal complexity) and lose quality. Bias mitigation must always be evaluated measurably, not just "patched on".
Benchmarks like BBQ are not fully transferable. They are English-language and US-demographic. For Swiss use cases they are a hint, not proof. Building custom tests is mandatory.
Trade-offs
STRENGTHS
- Meets Const. Art. 8, GlG, FADP Art. 21 and EU AI Act Art. 10
- Detects subtle discrimination that manual review misses
- Quarterly repetition documents ongoing diligence
- Custom Swiss tests cover DE-FR-IT specifics that generic benchmarks ignore
- Disparate impact analysis provides concrete improvement levers
WEAKNESSES
- Initial audit 5-12 days + CHF 7-18k for a mid-size pipeline
- Generic benchmarks (BBQ) are English-oriented and only partly transferable
- Fairness vs accuracy tradeoff: mitigation can slightly lower quality
- Bias profiles change with every model update – maintenance needed
- Full bias elimination is impossible – residual risk remains
FAQ
What does a full bias audit cost?
Initial audit for a mid-size pipeline: 5-12 days of engineer and domain expert time, around CHF 7,000-18,000. Ongoing quarterly re-audits: 1-2 days. Tool cost: BBQ, StereoSet, CrowS-Pairs are open source and free. Token cost for a full scan on a commercial model: USD 50-200 per run.
Which model is least biased?
On the BBQ aggregate, current frontier models (latest generation from OpenAI, Anthropic, Google, Mistral) reach very low bias scores near 0 today, with only small gaps between them. Older and open models sit considerably higher. Concrete score values shift with every model generation – check the current HELM/BBQ leaderboard before choosing. Note: BBQ measures English-language bias patterns – for Swiss use cases custom tests must complement.
How do I concretely check language accent bias?
Example client routing pipeline. Take 50 real client emails and create two variants each: standard German and with dialect or foreign-language admixtures (Swiss-German vocabulary, French borrowings, Italian phrases). Identical content, only language presentation differs. Run both variants through the pipeline and compare: same classification? Same priority? Same case-worker assignment? Differences above 5% are bias suspicion.
What if bias mitigation lowers quality?
Classic fairness-vs-accuracy tradeoff. Three options: (a) accept the small quality drop as the price of fairness (mandatory for high-risk systems, Art. 10 EU AI Act); (b) additional data for the under-represented group, so fairness and quality both rise; (c) hybrid pipeline with group-specific thresholds plus transparent documentation. A 1-2% quality drop against an anti-discrimination duty is always legally required.
Related topics
Sources
- Parrish et al. – BBQ: A Hand-Built Bias Benchmark for Question Answering · 2022-05
- Nadeem et al. – StereoSet: Measuring stereotypical bias in pretrained language models · 2021-08
- Nangia et al. – CrowS-Pairs: A Challenge Dataset for Measuring Social Biases · 2020-11
- Stanford CRFM – HELM (Holistic Evaluation of Language Models) · 2026-04
- EU AI Act, Article 10 – Data and Data Governance · 2024-07
- EDÖB – Leitfaden zu automatisierten Einzelentscheidungen · 2025-11