AUDIT TRAIL · AI CONCEPT

AI audit trail design: what to log so an AI answer stays audit-ready

Which fields must be stored per LLM call so you stay clean under Art. 957a CO and professional secrecy – and with which tools.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is an AI audit trail?

An AI audit trail is a complete, tamper-evident record of every interaction between an employee, a language model, and the source data used. It answers four questions: who asked what and when? Which model answered? Which sources were drawn upon? Who approved the answer?

A classical IT audit trail (e.g. ERP system) gets by with user login plus action log. For AI use, that is not enough: the model is a probabilistic function whose output depends on prompt, model version, and extra context (RAG sources, tool calls). If an auditor in five years wants to retrace why an AI classified a booking as "cleaning supplies", you need all four building blocks: the original prompt, the model identity including version, the retrieval context, and the human approval.

The OpenTelemetry standard for GenAI moved out of experimental for client spans in early 2026 and defines uniform attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.id). Anyone building today should orient on this convention because it is independent of individual tool vendors.

Why it matters

Three legal anchors force clean record-keeping. First, Art. 957a CO and the Business Records Ordinance (GeBüV): anyone processing business-relevant facts with AI (bookings, tax filings, remuneration decisions) must keep those decision steps traceable for ten years. An AI without an audit trail is a black box and is not GeBüV-compliant.

Second, revFADP / revDSG: for automated individual decisions with significant impact on the data subject (Art. 21 revDSG), you must disclose the logic of the decision on request. Without a structured audit trail you cannot. Third, professional secrecy (Art. 321 SCC, BGFA, Medical Profession Act): when a protected professional uses AI, they must prove that the data did not reach unauthorised parties – including OpenAI employees. The audit trail documents which data went to which provider.

Add practical use: when an answer is wrong, you find the cause faster. Was it an outdated RAG hit? A model version that changed overnight? A prompt template bug? Without a trail you guess. With one you debug in minutes.

How it works

A production-ready AI audit trail logs at least these 14 fields per LLM call. We group them by who / what / when / with what.

Who: (1) user_id or pseudonymised employee ID, (2) client_id (for fiduciary or legal context, pseudonymised if logs are evaluated centrally), (3) human_approver_id and approval timestamp once the answer goes into productive use.

What: (4) prompt_hash (SHA-256 of the cleartext prompt – the cleartext is additionally stored only if it contains no personal data; for sensitive data the hash plus the token count is enough), (5) output_hash and optionally output_text (depending on confidentiality), (6) rag_source_ids – the list of document and chunk IDs supplied as context, (7) tool_calls (if the model invoked external functions: names and arguments).

When: (8) request_timestamp_utc with millisecond precision, (9) response_timestamp_utc, (10) latency_ms (gen_ai.client.operation.duration in OTel terminology).

With what: (11) model_provider (anthropic, openai, mistral, local), (12) model_id with full version (claude-opus-<version>, gpt-<version>, mistral-large-<version>), (13) input_tokens and output_tokens (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens), (14) cost_chf, calculated from token counts and the price valid at the time.

For pseudonymisation we recommend HMAC-SHA-256 with a rotating salt – that lets you map back on demand (internally, with the master salt) while keeping raw identification out of the cleartext log. Storage: PostgreSQL as primary source, plus WORM storage (S3 Object Lock or equivalent) for the 10-year archive. Tools: Langfuse self-hosted is the May-2026 standard kit for LLM tracing, alternatively Helicone, Arize Phoenix, or a custom OpenTelemetry Collector with ClickHouse as backend.

Audit trail implementation in 8 steps

01Inventory: which AI workflows carry which legal relevance? Classify as "internally discarded / externally addressed / business-relevant".
02Field catalogue: define per workflow which of the 14 fields are mandatory. Professional-secrecy pipelines get all 14, the marketing playground only 3.
03Pseudonymisation: apply HMAC-SHA-256 with rotating salt on client_id and user_id. Master salt in the vault.
04Force model versioning: in every API call store the fully qualified model name (claude-opus (aktuelle Version)), not the alias.
05Choose storage: PostgreSQL as live backend plus WORM archive (S3 Object Lock, MinIO with Object Lock) for the 10-year holding.
06Install tooling: Langfuse self-hosted via Docker – ingress endpoint /api/public/otel – or OTel Collector with ClickHouse for a custom build.
07Use the OTel convention: set gen_ai.system, gen_ai.request.model, gen_ai.usage.* in the SDK – secures tool swaps without data migration.
08Periodic check: every quarter sample 10 trail entries and replay – can an outside auditor follow the decision?

When an audit trail is mandatory

A structured audit trail is mandatory the moment an AI answer enters a business-relevant process: a booking, a dunning letter, a tax filing, a personnel decision, a diagnostic suggestion, a legal advice. As soon as an employee copies the AI answer into a document later used against third parties, the 10-year retention obligation under Art. 958f CO kicks in.

Concrete areas: fiduciary bookings (which receipt was classified by which model how?), collections (why did the bot rate this client as "first reminder"?), law firm (which statute research did the model deliver on which question?), HR (why did the screening tool reject this application?), insurance (why was this claim marked "not covered"?).

Even outside legal duty the trail helps: on every bad output you have the whole chain at hand and can fix without guessing.

When less is enough

For internal experiments, marketing brainstorms, code generation in personal projects, translation of publicly available content, simple logging with user_id, timestamp, and token consumption for cost control is enough. A full 14-field trail here is overkill and costs more than it gives.

Rule of thumb: if the AI answer gets thrown away at end of day or only serves as internal inspiration, you do not need a full trail. As soon as the answer has an external addressee (client, authority, customer, employee with personnel decision), you do.

A common mistake: trail over-engineering. Some teams log everything including cleartext prompt for internal marketing tasks, generate 20 GB of audit data per month, and are surprised their data protection officer gets nervous because the trail holds more personal data than the original workflow. Trail fields are not "more is better" – they are set deliberately.

Trade-offs

STRENGTHS

Traceability for tax authorities, audit, professional associations
Fast debugging of wrong AI answers
Comparability across model versions – what changed since the last update?
Cost transparency per client, per workflow, per model
Solid evidence for data protection access requests under Art. 25 revDSG

WEAKNESSES

Storage need: 200 MB to 2 GB per month for medium setups, more with full prompt cleartext
Initial implementation effort 3 to 8 days including schema design
A misconfigured trail itself can leak personal data – careful pseudonymisation is mandatory
Tool choice is still moving – OTel GenAI conventions are partly experimental

FAQ

Do I have to store the cleartext prompt?

Not mandatory. For sensitive content, a SHA-256 hash plus token count and the fields that let you check the decision (e.g. classification category, RAG source IDs) is enough. Keep the cleartext prompt if it contains no especially protected personal data and you need it for later model improvement. Clear schema separation: prompt_hash always, prompt_text optional and gated by sensitivity class.

How long must I retain this?

For business-relevant processes: 10 years under Art. 958f CO / GeBüV. For revDSG processing without bookkeeping link: as long as the purpose requires, plus the relevant limitation period (civil law often 10 years, Art. 127 CO). For especially protected personal data without bookkeeping link: as short as possible, document clear deletion deadlines in the trail schema.

Is Langfuse enough or do I need OpenTelemetry?

Langfuse self-hosted is enough for 90% of Swiss SME setups. If you already run an OpenTelemetry-based observability platform (e.g. Grafana Tempo, Honeycomb, Datadog), instrument against gen_ai.* conventions and route LLM tracing into the existing pipeline. Langfuse itself accepts OTel traces via /api/public/otel – the two approaches do not exclude each other.

What if the model does not return the model ID?

Some local setups lack an explicit version. Solution: at model deployment, set an immutable tag (e.g. llama-3-1-70b-instruct-q4-k-m-20260301), and in the calling layer (LiteLLM, Ollama proxy) actively write that identity into the trail. Never store only "llama" or "local" as the model ID – that helps no one in five years.

Sources

Art. 957a OR – Geschäftsbücher und Belegführung · 2026-01
Langfuse – OpenTelemetry (OTLP) ingress and LLM observability docs · 2026-04
OpenTelemetry – Semantic Conventions for GenAI · 2026-03
EDÖB – Leitfaden zur Datenschutz-Folgenabschätzung · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call