DATA CLEANING · AI CONCEPT

Data cleaning before RAG: duplicates, boilerplate, OCR artefacts, charset issues, watermarks

Why 30 percent of any RAG corpus is junk and how to remove it: duplicate detection, header/footer stripping, OCR correction, encoding repair and watermark removal with cleanlab and dedupe.io.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is pre-RAG data cleaning?

Pre-RAG data cleaning is the systematic removal of junk content from the corpus before it is chunked, embedded and indexed. "Junk" has six main forms: duplicates, blank pages, footer/header boilerplate, OCR artefacts, charset/encoding damage, watermarks and PDF watermarks.

In practice, 20 to 40 percent of an unprocessed fiduciary or legal corpus is junk. Two direct consequences follow: first, storage and embedding cost rise linearly with corpus volume; second and worse, retrieval precision drops because identical or near-identical chunks flood the top-k results and crowd out thematically relevant but rarer chunks.

Data cleaning is not nice-to-have but a precondition for usable answer quality. We measure the effect regularly: after systematic cleaning, retrieval recall@5 rises 15 to 30 percent and answer quality in blind tests rises half a step to a full step on a 5-point scale.

As of May 2026, established tools exist: cleanlab (open source plus cloud) finds duplicated, ambiguous or mislabelled data points. dedupe.io is the industry-standard library for fuzzy record matching. fuzzywuzzy/RapidFuzz for fast string distance. Custom regex for boilerplate stripping. ftfy (fixes text for you) repairs encoding damage. Tesseract-specific post-processing tools (PostOCR-Toolkit) correct typical OCR errors.

Why it matters

Three effects make data cleaning the most expensive omission in RAG construction.

First: retrieval bias from duplicates. If a 30-page lease lives in 12 different client folders (original, copy, mail attachment, reply attachment, scan, OCR version), naive indexing creates 12x the same chunk. A question about this lease fills the top-12 results with the same content, and the model sees no alternative angles. Deduplication reduces this to one chunk plus 11 metadata references.

Second: boilerplate noise. Every client PDF carries header and footer with company name, address, phone, date. Without removal you index the same address block thousands of times. A question "who is the managing director?" hits these blocks instead of the answer inside the contract body. Embedding models build thematic emphasis from frequent material - boilerplate distorts vector-space geometry.

Third: OCR artefacts as invisible textual bombs. Tesseract regularly confuses "1" and "I", "0" and "O", "rn" and "m". A typo in the embedding input shifts the vector measurably, so a relevant passage is no longer found. For balance-sheet figures this is catastrophic: "100,000" becomes "I00,000" and the model fails the assets/liabilities check.

For fiduciary and legal contexts a fourth dimension is added: legal data-quality duty. Art. 957a CO demands complete, correct, clear bookkeeping. A balance-sheet answer distorted by OCR errors that ends up in a tax return can create liability. Cleaning is therefore a compliance measure, not just performance tuning.

How it works

Duplicate detection: exact duplicates (same hash) found in seconds via MD5/SHA-256. Fuzzy duplicates (same content, slight formatting differences) via MinHash or SimHash (datasketch library). Semantic duplicates (same meaning, different words) via embedding proximity (cosine similarity above 0.95). cleanlab and dedupe.io combine these strategies.

Empty and mini chunks: chunks under 50 tokens usually contain only a page number, date or single heading. Filter out. Chunks with 80 percent whitespace or special characters also.

Boilerplate stripping: per client or source system, recurring header and footer blocks are identified. Tool approach: compare all chunks from the same source mailbox; prefix and suffix strings appearing in more than 80 percent of chunks are cut. Boilerpy3, trafilatura or custom regex.

OCR correction: heuristics (typical Tesseract confusions: 0/O, 1/I/l, rn/m) via language-model spellcheck. PostOCR-Toolkit, ftfy OCR mode, or own LLM postprocessing (Claude Sonnet / Mistral correct OCR output at a token cost of about 1 US cent per page).

Charset/encoding repair: ftfy (fixes text for you) repairs common encoding damage (UTF-8 decoded as Latin-1 etc.). chardet detects the most likely encoding. Mojibake (multi-encoding damage) is usually detected and repaired automatically.

Watermarks and PDF watermarks: watermark text ("DRAFT", "CONFIDENTIAL", client watermarks with date) is detected via frequency analysis (appears on every page of a file) and removed. For PDF watermarks as an image layer, pdftotext with the -layout flag often helps because the watermark image is ignored.

Language filter: when a German client context is expected and the chunk shows EN, ES or random charset: usually junk. langdetect or fasttext-lid detect the language, off-language chunks go into a quarantine queue for manual review.

Order matters. We recommend: first fix encoding, then OCR correction, then boilerplate stripping, then dedup, then language filter, finally mini-chunk filter. The wrong order creates phantom problems.

Cleaning workflow in 6 steps

01Diagnosis: manually inspect a sample of 200 random chunks, measure junk rate per source system.
02Encoding repair: apply ftfy to all texts, check with chardet, fix mojibake.
03OCR postprocessing: heuristics per source system (Tesseract-typical, Mistral OCR-typical) plus LLM correction on low confidence.
04Boilerplate stripping: per source mailbox / client folder detect frequent prefix/suffix strings and cut them.
05Duplicate dedup: first exact (hash), then fuzzy (MinHash), optionally semantic (embedding cosine 0.97+, with metadata safeguard).
06Language filter and mini-chunk filter: check expected languages, drop chunks under 50 tokens. Audit log for every removed chunk.

When to use it

Always. There is no sensible RAG setup without cleaning. The question is just how elaborate.

For small corpora (under 1000 documents) a simple script with MD5 dedup, boilerplate regex and ftfy usually suffices. Setup: half a day.

For mid-size corpora (1000 to 50,000 documents) cleanlab or dedupe.io plus an OCR-correction pipeline pays off. Setup: 2 to 4 days.

For large or heterogeneous corpora (over 50,000 documents, many source systems): a comprehensive pipeline with quality gates and manual spot-check. Setup: 5 to 15 days.

Important: cleaning must never run only once. It is reapplied at every re-indexing; on pipeline updates, the underlying rules must be versioned.

When not to use

Data cleaning is never optional. But certain cleaning steps can be counterproductive.

Aggressive dedup logic must not confuse "almost identical" with "identical". Two contracts with identical boilerplate but different clients and terms must both be kept. Embedding cosine 0.95 as a dedup threshold usually does not suffice; metadata comparison (same client, same date) must also be checked.

LLM-based OCR correction must be applied with care because the models themselves can hallucinate. On balance-sheet data, corrections with confidence under 99 percent go to a review queue, not directly into production.

Boilerplate stripping must not remove important client address information when that is relevant to the RAG use case (e.g. CRM analytics).

In especially sensitive contexts (Art. 321 SCC) cleaning must never modify the original document. Cleaning always happens on a copy; the original stays unchanged in audit storage.

Trade-offs

STRENGTHS

Retrieval recall@5 typically rises 15 to 30 percent
Storage and embedding cost drop 20 to 40 percent
Prevents systematic wrong answers from OCR artefacts
Meets data-quality duties under Art. 957a CO

WEAKNESSES

Setup effort: 2 to 15 days depending on corpus size and diversity
Aggressive dedup can delete legitimate variants
LLM-based OCR correction can introduce model hallucinations
Pipeline must rerun idempotently on every re-indexing

FAQ

How much of the corpus is typically removed by cleaning?

In a fiduciary office with historical mailbox: 25 to 45 percent. In a curated guideline library: 5 to 10 percent. In a legal archive with OCR scans: 30 to 50 percent. We measure before and after cleaning - the figure is an important quality indicator.

When is LLM-based OCR correction worth it?

For accounting-relevant data (balance sheets, payslips, VAT statements): always, because every wrong character can have consequences. For prose-heavy documents: only on low OCR confidence. Cost: about 1 US cent per page with Claude Haiku, 5 cents with Sonnet.

What is the difference between cleanlab and dedupe.io?

cleanlab focuses on ML data quality (labels, outliers, duplicates for training sets). dedupe.io specialises in structured records (client database dedup, CRM cleaning). For pure text corpora, cleanlab or a custom MinHash solution fits better; for client database dedup use dedupe.io.

What does cleaning cost on an ongoing basis?

Server cost under CHF 50 per month for mid-size corpora. LLM cost for OCR correction scales with volume: at 10,000 pages per month about CHF 100 with Haiku, CHF 500 with Sonnet. Cleaning should be 5 to 10 percent of total RAG cost - anything above signals a skewed pipeline.

Sources

cleanlab - Find label issues and duplicates in any dataset · 2026-05
dedupe.io - Library for fuzzy record matching · 2026-05
ftfy - fixes text for you (encoding repair) · 2026-05
datasketch - MinHash and HyperLogLog for near-duplicate detection · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call