EMAIL ARCHIVES · AI CONCEPT

Indexing email archives: IMAP, EWS, Microsoft Graph, MBOX and Swiss data-protection compliance

How to index 5 to 10 years of client correspondence for RAG: IMAP, EWS, Microsoft Graph, MBOX imports, attachment extraction, thread preservation, Swiss-DSG and professional-secrecy compliant.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is email archive indexing?

Email archive indexing is the structured extraction and vectorisation of mail corpora to make them usable as a RAG source. As of May 2026 this is a standard pattern in fiduciary, legal and insurance firms because 60 to 80 percent of client correspondence historically sits in mailboxes - not in CRM systems.

The task is not trivial. A typical mailbox holds 20,000 to 100,000 emails. Every email has headers (From, To, CC, Subject, Date), body (HTML and plain-text variant), attachments (PDF, DOCX, images), thread relationships (In-Reply-To, References) and possibly encrypted content (S/MIME, PGP). Multilingual client correspondence mixes DE, EN, FR, IT in one conversation.

Protocol choice determines the pipeline. IMAP is universal but slow. EWS (Exchange Web Services) and Microsoft Graph are the right paths for Microsoft 365 / Exchange Server. MBOX import makes sense when an archive export already exists (e.g. after a client change or compliance export).

For Swiss firms, Swiss-DSG compliance is the decisive filter. Emails frequently contain especially sensitive personal data (health data, religious affiliation, political activity). Professional secrecy under Art. 321 SCC (lawyers, doctors) demands additional safeguards. Indexing must take place locally or in an auditable EU/CH cloud, with clear purpose limitation and a deletion concept.

Why it matters

A RAG system without email indexing is half-blind. Anyone answering a client question ("what did we write Mr Bachmann about the 2024 VAT correction?") finds the answer almost always in an email, not in a Word document. Anyone building AI assistance for staff while omitting emails burns 70 percent of the possible value.

A manual mail search typically takes 15 to 45 minutes per case in a fiduciary office (searching, scrolling, assembling quote blocks). With an indexed archive plus RAG that becomes 30 seconds. At 50 such searches per month in a 5-person fiduciary, that is 20 hours saved, or CHF 1800 per month at an internal rate of CHF 90.

Critical points: threading must be preserved. A single email is often meaningless ("see below" without context). Replies quote earlier mails, so naive chunkers index the same content 5 to 10 times - storage waste and retrieval noise. Attachments need their own path: a 30-page contract in an attachment must not ride along in the mail chunk but land independently in the RAG index, with a back-reference to the originating email.

Swiss DSG is not just compliance overhead but a business-model risk. Mail indexing that handles lawyer-client correspondence carelessly can be judged a breach of professional secrecy under Art. 321 SCC - fines and a professional ban possible. Indexing must be encrypted at rest, processed locally, with audit log and right-to-erasure implementation.

How it works

IMAP: universal standard, supported by every mail server. Python imaplib or the higher-level imap-tools library reads folders. Caveat: many servers limit connection counts, parallelism must be throttled. A 50,000-mail archive takes 4 to 12 hours for a full IMAP sync.

EWS (Exchange Web Services): SOAP API for Exchange Server and Microsoft 365. Python exchangelib or ews-java-api. Less rate-limited than IMAP. Returns MIME mails directly or structured XML. Useful in Exchange on-premise migrations.

Microsoft Graph: modern REST API for Microsoft 365. Python msgraph-sdk or direct HTTP. Returns JSON. Token-based auth via Azure App Registration. Recommended for Microsoft 365 tenants. Pagination via @odata.nextLink. Rate limit: 4 requests/second, scalable to service-to-service with app permissions.

MBOX/PST import: for archive exports (e.g. after a client change). Python mailbox module reads MBOX, libpff or pypff reads PST. Custom adapters for Outlook archives (.ost).

Threading: In-Reply-To and References headers reconstruct conversation trees. Subject heuristics (Re:, Fwd:) are unreliable. The JWZ threading algorithm (Jamie Zawinski, 1997) is the standard. Per thread a conversation summary should be built, not every individual mail in isolation.

De-duplication: quote blocks in replies are stripped via heuristics (on "Am ... schrieb ..." markers, "On ... wrote", line dividers). Tools: talon (Mailgun), email-reply-parser. Per mail only the "newly written" part is indexed; the quote part links to the predecessor.

Attachment path: attachments are extracted, processed by document loaders (see document-loaders-formate), indexed separately and linked via an "attached_to_mail_id" field. A RAG query "client Bachmann VAT" then finds both the email and the attached VAT statement.

Storage and encryption: mail bodies and attachments stay in the original store (encrypted at rest, e.g. LUKS or encrypted Postgres TDE). Only embeddings and metadata go into the vector DB. On a client erasure request, all embedding entries with that ID must be removed - precondition: a payload-indexed client ID in Qdrant.

Mail archive indexing workflow in 6 steps

01Mailbox inventory: count, time span, languages, encryption share, attachment volume. Choose protocol (IMAP, EWS, Graph, MBOX).
02Legal basis and Swiss-DSG concept: purpose limitation, storage period, deletion concept, client notification, DPA with processor.
03Quote stripping and JWZ threading: per mail only extract the newly written part, reconstruct thread relationships.
04Attachment pipeline: extract attachments separately, route through document loaders, link with "attached_to_mail_id".
05Embedding and vector DB: chunk mail body (500 to 800 tokens), attach metadata (From, To, Date, client, confidentiality), index in Qdrant.
06Deletion and audit pipeline: payload-indexed client ID; an erasure request removes all embeddings with that ID; every indexing run logged with timestamp and operator.

When to use it

Fiduciary, legal and insurance firms with historical email correspondence from 5+ years or 50,000+ mails. Manual search is the biggest time sink here.

Firms with client onboarding processes where old mail threads matter ("what did we discuss with this client three years ago?").

Customer service / support inboxes with recurring questions where AI-assisted reply suggestions make sense.

Compliance audits requiring searches of mail histories for specific topics (e.g. AMLA check, ESG disclosure tracking).

In legal contexts for forensic cases (e-discovery), where large mail volumes must be sorted quickly.

When not to use

Small mailboxes under 5000 mails: simple full-text search (Outlook, Thunderbird) suffices. RAG setup does not pay off.

Mail corpora with a high share of encrypted mails (S/MIME, PGP) that cannot be decrypted without keys. Resolve the key workflow first.

Customer-service setups with high privacy sensitivity (psychological counselling, addiction counselling), where re-identifying individual clients via the RAG system is a risk. Require strict pseudonymisation (see anonymisierung-pseudonymisierung) before indexing.

Third-party mailboxes for which no purpose limitation can be shown. Indexing foreign mail corpora without clear client authority violates Swiss DSG and potentially Art. 321bis SCC.

Trade-offs

STRENGTHS

Makes 60 to 80 percent of historical correspondence usable for AI
Client research from 30 minutes to 30 seconds
JWZ threading and quote stripping cut index size significantly
Attachment path links mails to associated documents

WEAKNESSES

High Swiss-DSG and professional-secrecy risk, careful design needed
Encrypted mails (S/MIME, PGP) require a key workflow
IMAP full sync of large mailboxes takes hours
Quote stripping never perfect, individual duplicates remain

FAQ

IMAP or Microsoft Graph?

For Microsoft 365 always Graph (better documented, higher rate limits, modern auth). For Exchange on-premise: EWS or Graph (from Exchange 2019). IMAP only for foreign mail servers (Gmail, small hosters) where Graph or EWS are unavailable.

How do I handle encrypted mails (S/MIME, PGP)?

Decrypt before indexing (with the client or firm key material), index plaintext, keep the encryption marker in metadata. Without keys: index mails only via headers (Subject, From, Date), the body stays encrypted and unsearchable.

How do I prevent duplicates from quote blocks?

Quote stripping with talon (Mailgun) or email-reply-parser before embedding. Chunk only the newly written part per mail. The quoted part is referenced via thread relationship, not re-indexed. Saves 50 to 70 percent storage.

What does indexing a 50,000-mail archive cost?

Embedding cost (OpenAI text-embedding-3-small): about CHF 8 to 15 one-off. Qdrant storage: under CHF 3 per month. Attachment OCR (see ocr-für-belege-und-verträge) depending on volume: CHF 50 to 300 one-off. Server hosting for the pipeline: CHF 25 to 50 per month. One-off total: CHF 60 to 320, ongoing about CHF 30 to 60 per month.

Sources

Microsoft Graph - Mail API documentation · 2026-05
Mailgun talon - email quote and signature extraction · 2026-05
JWZ - message threading algorithm · 2026-05
EDÖB - data protection guidance for email archives · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call