STRUCTURED EXTRACTION · AI CONCEPT

Structured extraction with grounding: cite-the-source from PDFs and emails

Extract data from PDFs and emails with source proofs: Gemini Citation API, Anthropic with_citations, manual citation linkers and audit-trail link.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is structured extraction with grounding?

Structured extraction with grounding combines two techniques: extracting data in a defined schema from unstructured sources (PDFs, emails, OCR scans, web pages) – and at the same time providing a verifiable source citation ("cite-the-source") for each extracted value. The result is not only "amount = 1,500 CHF" but "amount = 1,500 CHF, source: page 2, line 14".

The difference from plain structured extraction is the proof layer. An SME fiduciary system capturing receipts and writing to the books must be able to prove on a later audit where each value came from. With grounding, every extracted value has a reference: document ID + page + bounding box (for PDFs) or mail ID + character offset (for emails).

Practical application: receipt capture with audit trail. Per receipt the AI extracts date, amount, VAT, vendor, document number – plus a source pointer for each. When the client asks two years later "why was this booked at 7.7% VAT?", the system can answer: "on page 2 bottom left it reads 'VAT 7.7%'". That is GeBüV-compliant and audit-ready.

As of May 2026 the market has matured significantly. Anthropic extended the `with_citations` API in April 2026. Google Gemini 2.5 has a native Citation API. Both produce structured answers with span pointers into the source document. For local models there are open-source solutions like LlamaIndex Citation Query Engine or manual citation linkers based on embeddings.

Why it matters

Without grounding, structured extraction is problematic in regulated environments. An AI extracts "amount = 1,500 CHF" from a receipt. If the real receipt says "1,500 USD", that is a hallucination damage. With grounding it is immediately visible that the source says "USD" while the model wrote "CHF" – the error is detectable and correctable.

For GeBüV (Business Records Ordinance) and Art. 957a CO (bookkeeping duty): accounting data must be traceable back to the source. AI-supported receipt capture without grounding does not meet this. With grounding, every entry in the general ledger has a direct link to the receipt PDF – auditor inspection with one click.

For EU AI Act Art. 12 (record-keeping) and Art. 14 (human oversight): grounding is the technical basis for auditability. A supervisory authority wanting to understand how the system reached a classification gets not only the output but the source pointers as well.

In damage cases: if a client claims the AI captured data incorrectly, the fiduciary office can immediately check with grounding – "the source says exactly this, the error was on the client side" or "the source says something different, we have an error". Without grounding the evidentiary situation is murkier.

Process efficiency: in sample audits (internal review, external audit), the reviewer does not have to read the whole PDF to understand where a value came from. One click on the citation link jumps to the spot. Reviewer time per receipt drops 60-80%.

How it works – methods as of May 2026

Anthropic Claude with_citations. Introduced in 2024, extended in April 2026 to structured JSON outputs with citations. You send the model one or more documents (PDF text, OCR output) plus a question. The model answers with JSON output where each extracted value contains a `citations` array: `{"amount": 1500, "amount_citation": [{"document_id": "doc1", "start_char": 1240, "end_char": 1248, "cited_text": "1500.00 CHF"}]}`. Citation accuracy as of May 2026: about 95% correct span pointers.

Google Gemini 2.5 Citation API. Similar approach. Gemini 2.5 Pro with `groundingMetadata` enabled delivers token-level pointers. Via the `groundingSupports` block, extracted values are mapped to concrete doc indices and char spans. Well integrated with Google Drive and Workspace sources.

OpenAI citations via function calling. OpenAI has no native citation API as of May 2026, but the same behaviour is achievable via function calling. You define a schema with `extracted_value` and `source_quote` as required fields. The model must provide a source quote for every value. Citation accuracy: about 90% – weaker than native APIs because no token-level constraints.

LlamaIndex Citation Query Engine. Open source. Works with any LLM. Mechanism: document is split into chunks, each chunk gets an ID. On answer generation, the prompt is extended with "cite the used chunk ID after each statement". After the answer, embedding distance validates that the cited chunks were actually relevant. Very flexible, usable with all models.

Manual citation linkers. Pattern for production setups where you do not want to depend on vendor APIs. Two-pass approach: pass 1 the model extracts the structured JSON. Pass 2 a separate embedding-based linker finds the most likely source spot for each extracted value in the original document. Advantage: vendor-independent. Disadvantage: doubled effort, about 80-90% citation accuracy.

PDF bounding-box extraction. For legally robust audit trails a text offset is often not enough. Tools like pdfplumber or PyMuPDF combine text extraction with bounding-box coordinates. You store in the audit database not only "page 2, line 14" but also "(x: 142, y: 280, w: 80, h: 16)" – for pixel-precise highlighting on review.

OCR pipeline integration. For scanned receipts without digital text: OCR engine (Tesseract, Azure Form Recognizer, Google Document AI) provides a bounding box per recognised word. LLM gets the OCR output, produces structured output with citations that link to the OCR bounding boxes. Full pipeline: PDF → OCR (+ bounding boxes) → LLM structured extraction (+ citations) → audit DB.

Validation loop. After extraction + citation, a second step auto-checks: does the cited snippet actually exist at the specified spot in the original document? If not: mark the value as uncertain, trigger human review. Catches about 5% hallucinated citations.

Structured extraction with grounding in 6 steps

01Set up the source pipeline: PDF/mail/web input → text extraction with bounding boxes (pdfplumber, PyMuPDF, OCR).
02Define the extraction schema in Pydantic/Zod; each value field has a parallel `<field>_citation` field.
03Choose the LLM method: Anthropic with_citations, Gemini 2.5 Citation API, OpenAI function calling with mandatory quote, or LlamaIndex Citation Engine.
04Validation loop: after extraction auto-check that each cited_text actually exists in the source – substring match + embedding similarity.
05Audit DB schema: per extraction store value + citation (doc ID, char offset, bounding box, cited_text) + validation status.
06Build the review UI: clicking an audit entry opens the PDF at the right spot, highlights the bounding box.

When grounding is mandatory

For every extraction whose output feeds bookkeeping, contracts, tax filings or legally binding records, grounding is mandatory. Concretely:

Receipt capture for accounting. GeBüV requires traceability back to source. Without grounding, no compliance.

Contract clause extraction. When the AI extracts specific clauses from an 80-page contract, the lawyer must be able to trace exactly which spot is meant. With grounding: one click. Without: 30 minutes of searching.

Receivables and dunning. When the AI extracts a claim from email correspondence ("client owes us 4,500 CHF from order XYZ"), the source mail with concrete spot must be available – both for dunning escalation and for dispute.

Anti-money-laundering checks. For automated KYC/AML data extraction (PEP check, sanctions list match), the source must be provable. FINMA duty.

Clinical and medical applications. Here grounding is even a legal duty (MDR, MepV), but we do not build this ourselves – refer to specialised providers.

Less strictly required but recommended: every extraction regularly paired with human review. Grounding saves 60-80% reviewer time per document.

When grounding is less needed

For purely informational outputs that feed nowhere (brainstorming notes, quick knowledge lookups), grounding is overhead. An employee asking "summarise this 10-page PDF" and reading the result once needs no citation link.

For very short sources (1-2 pages) the reviewer looks at the original anyway – setup effort does not pay off. Manual verification is faster than a grounding pipeline.

For generative tasks without source reference (AI writes a rental contract from scratch), there are no sources to cite. Grounding is logically impossible.

Beware "grounding theatre": some vendor tools deliver citations that are formally present but hallucinated. A validation loop is therefore mandatory. A citation saying "page 2 line 14" without the cited text actually existing there is worse than no citation – it feigns trust.

Cost angle: grounding costs additional tokens (source text must be supplied fully) and some engineering time for the validation layer. For low-risk use cases with < 100 document volume per month the effort outweighs the value.

Trade-offs

STRENGTHS

GeBüV- and Art. 957a CO-compliant audit trail back to source
Reviewer time per document drops 60-80% via direct citation click
Hallucinated values are caught by the validation loop
EU AI Act Art. 12 (record-keeping) and Art. 14 (oversight) technically met
Dispute evidence: instantly provable where a value comes from on client query

WEAKNESSES

Setup effort 4-7 engineer days per pipeline
OCR bounding-box pipeline is hard to calibrate on poor scans
Vendor lock-in: Anthropic with_citations and Gemini Citation API have different schemas
Citation hallucinations possible (5%) – validation loop is mandatory
Token overhead 30-50% vs. plain structured output

FAQ

Which API is best for grounding in May 2026?

Anthropic the current top Claude model with with_citations (95% accuracy) and Google Gemini 2.5 Pro with groundingMetadata (93%) are at the top. The OpenAI function-calling setup is more flexible but somewhat weaker (90%). For EU sovereignty plus grounding, Mistral function_calling + LlamaIndex Citation Engine is a viable open-source alternative. Recommendation: Anthropic for law/fiduciary, Gemini for Workspace-integrated cases.

How do I validate that a citation is real?

Two steps. First: substring match – the cited_text must appear exactly (or with normalised whitespace) in the original document at the specified position. Second: for paraphrasing or OCR errors, an embedding similarity of cited_text to the original snippet at the position (cosine similarity > 0.85). Both checks automated, discrepancies → human review.

What does grounding cost extra?

Token overhead from citation fields in output: about 30-50% vs. plain JSON. Since structured output already saves tokens, net effect is usually neutral. Engineer effort for setup: 3-5 days for a receipt pipeline. Validation loop: 1-2 days. Total: grounding costs 4-7 days one-off, almost no running cost.

Does grounding work on OCR scans?

Yes, but pipeline setup is more involved. The OCR engine must provide a bounding box per recognised word (Tesseract, Azure Form Recognizer, Google Document AI). These bounding boxes are tied to citations. With poor OCR (handwriting, faxes, bad scans), grounding accuracy drops to 75-85%. Before production: measure OCR quality.

Sources

Anthropic – Claude with_citations API (structured output extension) · 2026-04
Google Gemini 2.5 – Grounding and Citations API documentation · 2026-04
LlamaIndex – Citation Query Engine guide · 2026-05
PyMuPDF – text extraction with bounding boxes (docs) · 2026-03
Azure AI Document Intelligence – Form Recognizer with bounding boxes · 2026-04

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call