DOCUMENT LOADERS · AI CONCEPT
Document loaders: cleanly ingesting PDF, DOCX, XLSX, EML, HTML and Markdown into RAG
Which tools convert which document formats losslessly into a RAG pipeline: Unstructured.io, LlamaParse, MarkItDown (Microsoft, May 2026), PyMuPDF and pandoc compared head-to-head.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What are document loaders?
Document loaders are the first station of any RAG pipeline. They take a document from its source format (PDF, DOCX, XLSX, EML, HTML, Markdown, EPUB, RTF, CSV) and return structured text plus metadata. Sounds simple, is not: a Word document contains track changes, comments, embedded tables, footnotes and cross-references. A PDF has multi-column layouts, headers and footers, embedded fonts, occasionally scanned pages. An email brings MIME multipart, attachments, HTML and plaintext variants and threading information.
As of May 2026, a tool landscape with clear division of labour has settled. Unstructured.io is the pragmatic all-rounder with a hi-res mode for layout recognition. LlamaParse (LlamaIndex) delivers excellent PDF extraction quality for a fee. MarkItDown (Microsoft, GA March 2025, stable in May 2026) is Microsoft-friendly and provides very clean conversions of Office formats to Markdown. PyMuPDF (AGPL/commercial) is the fastest open-source PDF library. Pandoc remains the gold standard for Markdown/RST/LaTeX conversions.
The choice depends on the format mix and the licensing model. Anyone processing only Microsoft Office is fastest with MarkItDown. Anyone receiving whatever clients send needs a pipeline with multiple loaders and a format-detection step up front.
Why it matters
Loader quality decides everything downstream. A loader that mixes the rows of a three-column PDF page across columns hands the chunker garbage, the embedder garbage, the retriever garbage, the language model garbage. Garbage in, garbage out, exponentially amplified by the pipeline.
For fiduciary, legal and government contexts, the loss question is especially relevant. Footnotes in a tax opinion, track changes in a contract draft, comments in an Excel worksheet often carry the legally decisive information. A loader that silently drops them undermines the evidentiary value of the AI answer.
Licensing matters too. PyMuPDF is AGPL: anyone using it in a productive SaaS platform must release their own software under AGPL or obtain a commercial licence from Artifex Software. Unstructured.io has a permissive Apache licence for the core library, the hosted API service is paid. LlamaParse is cloud-only and sends documents to LlamaIndex servers in the US, which is a Swiss revDSG and professional-secrecy problem for sensitive client data.
For EU/CH hosting we recommend self-hosted Unstructured.io (Docker) or MarkItDown locally as the default pipeline, with LlamaParse or Reducto AI as an optional premium path for complex layouts with explicit client consent.
How it works
PDF: three classes. (1) Text PDFs with embedded text: PyMuPDF, pdfplumber, pdfminer.six return text in milliseconds. (2) PDFs with complex layout: Unstructured.io Hi-Res, LlamaParse or Marker with layout recognition. (3) Scanned PDFs: route through OCR first (see ocr-für-belege-und-verträge).
DOCX: python-docx reads paragraph styles, tables, track changes. MarkItDown converts to Markdown and respects heading hierarchy. For complex Word documents with comment trails, headless LibreOffice remains the most reliable converter when fidelity to the original matters more than speed.
XLSX: openpyxl or pandas (read_excel) extract cells. MarkItDown serialises a worksheet as a Markdown table. Caveat: one sheet at a time. Formulas are usually exported as values, not as formula text. Merged cells and pivot tables are problem cases.
EML/MSG: Python mailparser, mail-parser or unstructured.partition.email read MIME multipart, decode HTML and text bodies, extract attachments (recursively back into the loader). The MSG format (Outlook) needs extract-msg or libpff.
HTML: BeautifulSoup with a boilerpipe variant (trafilatura, readability-lxml) removes navigation, ads, footer. For Markdown conversion, html2text, markdownify or MarkItDown are suitable.
Markdown: usually directly processable. If front matter (YAML/TOML) is present, extract via python-frontmatter and attach as metadata to the chunk. Pandoc can convert Markdown to JSON AST, which allows structure-aware chunking.
In practice you do not build a homogeneous pipeline but a dispatcher: detect file magic bytes via libmagic or python-magic, then call the matching loader. Unstructured.partition.auto bakes this dispatch in. MarkItDown likewise. Both have limits: complex PDFs are better off via LlamaParse, exotic mail formats (Lotus Notes, Apple Mail exports) via specialised parsers.
Loader workflow in 6 steps
- 01Build a file inventory: which formats appear, in what distribution, with what confidentiality?
- 02Write a dispatcher: libmagic/python-magic for format detection, then the right loader per format.
- 03Choose a default loader: Unstructured.partition.auto for the mix, MarkItDown for Office-only.
- 04Define a premium path: LlamaParse or Reducto AI for complex PDFs, with client consent documented.
- 05Metadata extraction: source, creation date, author, client, confidentiality. Attach to every chunk.
- 06Lossless test with 20 real documents per format: a human compares loader output against the original and fixes blind spots.
When to use which loader
MarkItDown for pure Microsoft Office stacks (DOCX, XLSX, PPTX, OneNote): fast, clean Markdown, MIT licence, runs locally.
Self-hosted Unstructured.io for the heterogeneous client mix: 30+ formats, layout recognition in hi-res mode, Apache licence, Docker deployment. Default recommendation for fiduciary and legal contexts.
LlamaParse for complex PDFs with tables, multi-column layouts and footnotes, when quality matters more than data sovereignty. Obtain client consent because data goes to US servers.
Reducto AI as a premium alternative to LlamaParse with EU hosting option since May 2026.
PyMuPDF for high-throughput PDF pipelines if the AGPL licence aligns with the business model or a commercial Artifex licence is in place.
Pandoc for scientific texts, LaTeX, RST, EPUB. Gold standard in academia and publishing.
When not to use
Single already-structured JSON or XML sources need no loader. A direct parser and a mapping onto the chunk schema is enough.
Images, plans, diagrams: belong in a vision pipeline (multimodal embeddings), not a text loader. Exception: Document AI (Google), which does both.
Database content: direct SQL adapter (text2sql) is faster and more accurate than the detour via CSV export and loader.
Live data (APIs, webhooks): no loader makes sense here, but a stream-ingestion setup with pub/sub.
Trade-offs
STRENGTHS
- Unstructured.io: many formats, layout mode, Apache licence, self-hostable
- MarkItDown: fast, MIT licence, ideal for Office stacks
- LlamaParse: highest PDF quality, good table extraction
- PyMuPDF: fastest open-source PDF library
WEAKNESSES
- Unstructured.io hi-res mode is compute-heavy, GPU recommended
- MarkItDown covers the Microsoft world well, less elsewhere
- LlamaParse cloud-only, data leaves to the US
- PyMuPDF AGPL licence, commercial licence needed for SaaS use
FAQ
Which loader is best for PDFs with tables?
As of May 2026, Marker (OSS, Apache licence) leads on purely open stacks, followed by Unstructured.io Hi-Res. Among cloud tools, LlamaParse Premium and Reducto AI deliver the highest quality, both above 95 percent accuracy on structured balance-sheet tables. Microsoft Table Transformer is the option when the model itself must be self-hosted.
May I use LlamaParse for client data?
Only with explicit client consent and a data processing agreement, since LlamaParse sends data to US servers. For data under professional secrecy (Art. 321 SCC) we advise against it. Self-hosted Unstructured.io or local MarkItDown are the legally safe alternatives.
What does a loader pipeline cost?
Open-source loaders (Unstructured core, MarkItDown, PyMuPDF on the OSS path): server costs only. A small fiduciary office with 50,000 pages a year runs this on a 4 vCPU server for CHF 25 a month. LlamaParse Premium costs 3 US cents per page, i.e. about CHF 1350 a year for 50,000 pages. Reducto AI is similar.
What about Lotus Notes, old .doc, .wpd?
libpff (Lotus Notes), antiword or LibreOffice (legacy .doc), wv (WordPerfect). These formats are rare and need dedicated adapters. As a fallback: convert the file to a more modern format (Office Online, LibreOffice CLI), then use the standard loader.