ANONYMISATION · AI CONCEPT

Anonymisation and pseudonymisation: Presidio, Privacera, k-anonymity, differential privacy

Tools and techniques as of May 2026 for Swiss-DSG-compliant PII removal before LLM processing: Microsoft Presidio, Privacera, Anonymizer, k-anonymity and differential privacy compared.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is the difference?

Anonymisation and pseudonymisation sound similar but are legally very different. Pseudonymisation replaces an identifier (e.g. client name) with a pseudonym ("client 123"); the mapping table is stored separately. With that table, identity can be restored. Pseudonymised data remain personal data under the Swiss data protection act (revDSG, Art. 5 lit. c) and GDPR (Art. 4 No. 5) - they enjoy the same protection as the original.

Anonymisation permanently removes the link to the person; even with additional knowledge, re-identification is no longer possible. Anonymous data fall outside the scope of data protection.

This matters for AI pipelines: anyone sending client mails through Microsoft Presidio with replacement and believing they are now anonymous is mistaken. Replacing "Bachmann" with "person_42" is pseudonymisation. Real anonymisation needs much more: deletion instead of replacement, generalisation of quasi-identifiers (date to month-only, postcode to 2 digits), and mathematical guarantees against re-identification (k-anonymity, l-diversity, differential privacy).

As of May 2026 the tool landscape is settled. Microsoft Presidio (MIT) is the open-source standard for PII detection and replacement. Privacera is enterprise cloud with a connector library for databases, data lakes and LLM gateways. Anonymizer (open source, May 2026 version 2.0) specialises in pre-LLM pipelines. For mathematically guaranteed anonymisation: ARX (open source, Java) for k-anonymity/l-diversity, OpenDP and Tumult Analytics for differential privacy.

Why it matters

Without clean PII handling, three risk scenarios are near.

First: cloud LLM providers see plaintext client data. Anyone sending a client mail to Claude or GPT-4 without masking names, addresses, AHV numbers, bank details hands these to a US provider. This violates Swiss-DSG cross-border rules without a data processing agreement with standard contractual clauses, a transfer impact assessment, and consent.

Second: professional secrecy breach. Lawyer correspondence (Art. 321 SCC), medical data (Art. 321 SCC), banking secrecy (BankA Art. 47) are criminally protected. A single leak can mean a professional ban. The FINMA supervisor clarified in a November 2024 circular: AI use without adequate data protection is a supervisory violation.

Third: training leak. So-called opt-out agreements with cloud providers guarantee inputs do not enter training but do not rule out caching, logs, internal audits. PII running through an LLM pipeline often appears in provider logs - in a data leak (see Anthropic, April 2024) the information leaves client secrecy.

Anonymisation solves the problem by keeping critical data out of the cloud entirely. Pseudonymisation moves the problem by making re-identification controllable. Which fits depends on the use case. For client-inquiry triage (internal, no output to clients): pseudonymisation suffices. For an analytics pipeline (statistics across client groups): anonymisation is mandatory.

How it works

Microsoft Presidio (MIT): two components. Analyzer detects PII via regex, NER models (spaCy, Stanza, Flair) and custom recognisers. Anonymizer replaces, hashes, redacts or encrypts the findings. Out-of-the-box detection for 30+ PII types including EU-specific (IBAN, EU ID), Swiss extensions via custom recognisers (AHV number 13 digits, VAT number with CHE prefix). Open source, runs locally, default recommendation for Swiss firms.

Privacera: cloud platform (with on-prem option), enterprise pricing. Connector library for Snowflake, Databricks, S3, LLM gateways. Useful for companies with a large data-lake landscape. Overkill for SME.

Anonymizer (open source, May 2026 v2.0): pre-LLM library focused on reverse pseudonymisation. Key material stays local; the LLM processes pseudonymised text; output is re-pseudonymised on return. Important for use cases where client identity must be restored in the output (a mail reply to "Mr Bachmann", not to "person_42").

k-anonymity (ARX, Java OSS): aggregate datasets are generalised so every record shares quasi-identifier values with at least k-1 others. k=5 is a common standard. Example: postcode/date/profession combinations are coarsened so every combination matches at least 5 people.

l-diversity and t-closeness are k-anonymity extensions that additionally protect against inference attacks via attribute distribution.

Differential privacy (OpenDP, Tumult Analytics): a mathematically provable guarantee. Per query, controlled noise is added; at a "privacy budget" epsilon of 1.0 it is guaranteed that the presence of a single record is not statistically inferable from the answer. Apple, Google and the US Census Bureau use DP in production. Mostly overkill for Swiss SMEs, mandatory for research projects with personal data.

In practice we combine Presidio for detection, Anonymizer for reverse pseudonymisation and ARX for analytics layers, all in an air-gapped pipeline. The cloud LLM sees only pseudonymised text; output is re-pseudonymised; analytics queries run on k-anonymised datasets.

PII protection workflow in 6 steps

01Inventory: which PII types appear in the corpus (name, address, AHV, IBAN, date of birth, profession, VAT number)? Which are quasi-identifiers?
02Classify the use case: pseudonymisation (plaintext output required) or anonymisation (statistical analysis)?
03Detection setup: Presidio with Swiss custom recognisers (AHV, CHE VAT, canton abbreviations). Measure recall on a 100-sample.
04Replacement/generalisation: replacement for pseudonymisation; generalisation (date to month, postcode to 2 digits) for anonymisation.
05Key custody: with pseudonymisation, store the mapping table in a separate encrypted DB (Postgres TDE or HashiCorp Vault).
06Audit log: every pseudonymisation and re-pseudonymisation logged with timestamp, client ID and operator. Erasure removes mapping plus embeddings.

When to use what

Presidio plus Anonymizer (reverse pseudonymisation): default for any cloud LLM pipeline with client data. Plaintext stays local, the cloud sees only pseudonymised content.

k-anonymity via ARX: for analytics use cases (client statistics, industry analyses) when individual records will be aggregated anyway.

Differential privacy: for research projects, public reporting with personal data, or cross-border data exchange with provable privacy guarantees. Rarely relevant for fiduciary and legal work.

Privacera: for corporate structures with a data-lake architecture and compliance requirements that need centrally managed policies.

Full on-prem LLM processing (Ollama, vLLM): when client trust or professional secrecy require that plaintext PII never leaves the firm.

When not to use

For fully synthetic or publicly available data (e.g. legal texts, industry regulations): no anonymisation needed.

For on-prem LLM setups without any cloud touchpoint: pseudonymisation is optional because no external processors are involved. But audit log and access control remain important.

For use cases where client identity must appear in the answer context (e.g. contract drafting): pseudonymisation without a reverse function is unusable. Anonymizer with a local mapping table is the right choice.

For data without personal reference (e.g. building master data, cantonal tax rates): anonymisation is pointless.

Watch the quasi-identifier trap: redacting only the name often does not suffice. A client file with "date of birth 1958-03-12, postcode 8001, profession notary" is often re-identifiable despite the missing name (Latanya Sweeney k-anonymity paper, 2002). Quasi-identifiers need generalisation or deletion.

Trade-offs

STRENGTHS

Presidio: open source, runs locally, MIT licence, broad PII coverage
Anonymizer: reverse pseudonymisation keeps client context in the output
k-anonymity via ARX: mathematically guaranteed protection tier
Differential privacy: provable guarantee even with repeated queries

WEAKNESSES

Pseudonymisation alone does not satisfy DSG: data stays personal
Quasi-identifier trap: redacting names is not enough
Swiss custom recognisers must be built (AHV, CHE VAT)
Differential privacy reduces data utility, tuning is laborious

FAQ

Is pseudonymisation enough for Swiss DSG?

Pseudonymised data remain personal data and stay subject to Swiss DSG. They reduce risk but do not exempt from purpose limitation, deletion duties, information of data subjects and processor agreements. Real anonymisation (with quasi-identifier generalisation and k-anonymity) falls outside the DSG scope.

How do I detect Swiss-specific PII?

Presidio out of the box covers EU IBAN and EU IDs. For Swiss specifics, custom recognisers are needed: AHV number (756.xxxx.xxxx.xx with modulo-11 check digit), VAT number (CHE-xxx.xxx.xxx), canton abbreviations, Swiss street patterns. We maintain an open collection of these recognisers in the fairlane.systems repository.

What happens in a re-identification attack?

An attacker combines pseudonymised data with publicly available quasi-identifiers (date of birth, postcode, profession) to restore identity. Sweeney's 1997 study showed 87 percent of the US population uniquely identifiable by date of birth, postcode and gender alone. Countermeasures: k-anonymity with k>=5, l-diversity on sensitive attributes.

How performant is Presidio at scale?

With default config (spaCy de_core_news_lg model) Presidio processes about 50 to 100 pages per second on an 8 vCPU machine. At 100,000 pages per day a 16 vCPU server suffices. Very high volumes (millions of documents) parallelise via the Spark connector.

Sources

Microsoft Presidio - PII detection and anonymisation · 2026-05
ARX - data anonymisation tool (k-anonymity, l-diversity) · 2026-05
OpenDP - differential privacy library · 2026-05
EDÖB - guidance on anonymisation and pseudonymisation under revDSG · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call