PDF TABLES · AI CONCEPT

PDF table extraction: Camelot, Tabula, pdfplumber, Table Transformer, Marker

Which tool extracts balance sheets, VAT tables and payslips losslessly from PDFs in May 2026: Camelot, Tabula, pdfplumber, Microsoft Table Transformer and Marker compared head-to-head.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is PDF table extraction?

PDF table extraction is the reading of structured tables from PDFs into machine-readable formats (CSV, JSON, Markdown, Excel). Sounds trivial, is not. PDF is a layout format, not a data format: a PDF knows pixels, lines and text boxes but not "cell", "row" or "column". Every table recogniser must reconstruct table boundaries, cell structure and reading order.

As of May 2026, two families exist. Rule-based tools (Camelot, Tabula, pdfplumber) look for ruling lines or whitespace patterns and are fast and deterministic but brittle. Model-based tools (Microsoft Table Transformer, Marker, Unstructured Hi-Res, LlamaParse Premium) use vision transformers for layout recognition and are robust to non-trivial layouts, but slower and GPU-hungry.

For fiduciary and accounting work, table extraction is a daily core task. Balance sheets, income statements, VAT tables, payslips, expense reports, account statements: all tables. A single cell error can produce a faulty tax return. Table extraction is therefore not a nice-to-have but the bottleneck for AI-supported accounting.

Why it matters

In Swiss fiduciary daily work, thousands of PDF balance sheets, VAT statements and payslips arrive monthly. Without automated table extraction, every value must be typed by hand. We estimate a 5-person fiduciary spends 80 to 120 hours per month on PDF data entry.

Good table extraction reduces this to 5 to 15 hours of review. The ROI is clear: at an internal hourly rate of CHF 90, a working pipeline saves CHF 6000 to 9000 per month in personnel cost, against typical setup cost of CHF 4000 to 8000 once.

Error tolerance is low. A balance sheet must not be "approximately" correct. The right strategy is therefore not "perfect algorithm" but "algorithm plus human in the loop": OCR/extraction proposes, a human verifies. Confidence scores decide which cells need review. For balance sheets we recommend a minimum 99.5 percent confidence per cell; anything below goes to the review queue.

Legal requirement: Art. 957a CO (bookkeeping) demands traceable data sources. The original PDF document, the extracted table data and any manual corrections must be stored immutably with timestamp and editor (see ai-audit-trail-design).

How it works

Camelot (MIT): Python library with two modes. "lattice" detects line-based tables (classic balance sheets with borders). "stream" detects whitespace tables. Fast, deterministic, well documented. Weak on complex multi-part tables or tables spanning page breaks.

Tabula (MIT): Java library with a Python wrapper (tabula-py). Similar to Camelot but better on simple whitespace tables. Standard tool since around 2014.

pdfplumber (MIT): Python library with low-level primitives (lines, rectangles, words). Highly flexible because table detection can be tuned by code. Requires writing code but gives full control. Default choice for fiduciary projects with recurring, uniformly formatted documents.

Microsoft Table Transformer (MIT): vision-transformer model that detects table layouts and outputs cell structure. Open-source, runs locally (slow on CPU, fast on GPU). Excellent on non-linear tables, weak on very dense balance-sheet layouts. A standard building block in self-hosted pipelines.

Marker (GPL-3, May 2026 state-of-the-art OSS): tool by Datalab, combines PDF parsing, layout detection and LLM post-correction. Converts entire PDFs (incl. tables) to clean Markdown. May 2026 benchmarks show Marker on par or better than LlamaParse Premium on table extraction, while being open-source. GPU recommended.

LlamaParse Premium / Reducto AI: cloud-only premium tools. Highest quality on complex tables but data leaves Switzerland. Client consent is mandatory.

In practice you build a cascade: simple balance-sheet PDFs go through Camelot or pdfplumber (fast, free). More complex layouts escalate to Marker or Table Transformer. For non-standard bank statements we recommend LLM post-processing: the extracted JSON is shown to Claude Sonnet or Mistral Large, which checks semantic plausibility ("does the balance match the balance-sheet equation?"). Edge cases go into the review queue.

Table extraction workflow in 6 steps

01Identify document types: balance sheet, income statement, VAT, payroll, account statement. One pipeline per type.
02Choose a default tool: pdfplumber for recurring, Marker for heterogeneous, Table Transformer for self-hosted GPU.
03Define premium escalation: if confidence under 95 percent or unreconstructible structure, escalate to LlamaParse Premium or Reducto AI.
04Schema validation: check the extracted JSON against a schema defined per document type (Pydantic, Zod).
05LLM post-correction: Claude Sonnet or Mistral Large checks plausibility (balance-sheet equation, VAT sums).
06Review queue: every cell with confidence under 99.5 percent or failed schema validation goes to a human reviewer.

When to use which tool

Camelot / pdfplumber: for recurring uniformly formatted documents (one bank, one payroll provider). Tune once, then extremely fast and deterministic.

Tabula: simple whitespace tables when a Java stack is already in place.

Microsoft Table Transformer: heterogeneous sources, self-hosted requirement, GPU available. Default for mid-size fiduciary stacks.

Marker: state-of-the-art OSS pipeline as of May 2026 when you can bear the GPL-3 licence and run Marker as a CLI. Best quality without cloud dependency.

LlamaParse Premium / Reducto AI: highest quality when cloud data flow is acceptable (client consent). Useful as an optional premium tier over an OSS default pipeline.

Unstructured Hi-Res: solid all-rounder if Unstructured is already your document loader.

When not to use

For very small volumes (under 50 documents per month) a setup does not pay off. Manual or a semi-automated tool like Adobe Acrobat Export is the right choice.

For free-text documents with only sporadic tables: prefer document-based chunking (see chunking-strategien-rag) with Markdown output - tables then become Markdown tables anyway.

For actually scanned PDFs (no embedded text): OCR first (see ocr-für-belege-und-verträge), then table extraction on the OCR output. Direct table extraction without OCR is pointless here.

For purely user-generated PDFs with unpredictable layout (e.g. client-side Excel exports): ask the client for the source file (XLSX). Extracting tables from a PDF export of an Excel file that still exists is unnecessary effort.

Trade-offs

STRENGTHS

pdfplumber: maximum control, MIT, one-off tuning effort pays off
Marker: May 2026 SOTA open source, on par with premium cloud
Table Transformer: local, MIT, robust against unusual layouts
Camelot/Tabula: fast, deterministic, good for simple ruled tables

WEAKNESSES

Rule-based tools weak on complex or cross-page tables
Vision models (Marker, Table Transformer) are GPU-hungry
Marker GPL-3 licence: check compliance for SaaS deployment
Premium cloud sends data to the US: Swiss-DSG check required

FAQ

Which tool for balance-sheet tables?

For recurring fiduciary client setups (same template): pdfplumber, tuned once. For heterogeneous sources: Marker or Microsoft Table Transformer. Premium cloud (LlamaParse, Reducto) only with client consent. Plausibility check (assets = liabilities) always via downstream LLM.

What does Marker cost to run?

Marker is open source (GPL-3). Running cost is only hardware: a GPU machine (RTX 4090 or L4 cloud) handles about 30 to 50 pages per minute. At 100,000 pages per year: CHF 1500 to 2500 for GPU hosting on Hetzner or OVH.

How do I handle tables spanning multiple pages?

Rule-based tools (Camelot, Tabula) often fail at page boundaries. Marker and LlamaParse Premium usually detect cross-page tables correctly. As a fallback: pdfplumber plus custom logic that recognises the header on the following page and joins tables. Schema validation required afterwards.

Do I need a GPU?

For Camelot, Tabula, pdfplumber: no. For Microsoft Table Transformer, Marker and Unstructured hi-res mode: recommended. At moderate volumes (under 10,000 pages per month) a cheap on-demand cloud GPU suffices. From 50,000 pages per month, an own GPU machine pays off.

Sources

Datalab - Marker open-source PDF-to-Markdown · 2026-05
Microsoft Research - Table Transformer paper and weights · 2026-05
pdfplumber - documentation and table extraction recipes · 2026-05
Camelot-py - documentation · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call