fairlane.systems

OCR · AI CONCEPT

OCR for receipts and contracts: Tesseract, AWS Textract, Azure DI, Google DocAI, Mistral OCR, Reducto

Which OCR engine to choose in May 2026 for scanned contracts, receipts and forms: accuracy, price, EU hosting, Swiss data-protection readiness and use-case recommendations.

Researched & fact-checked by: · As of: 2026-05

What is OCR?

OCR (optical character recognition) turns images or scanned pages into machine-readable text. As of May 2026, OCR is no longer "recognise characters" but layout understanding: tables, multi-column text, handwritten annotations, stamps, signatures, checkboxes. Modern engines combine CNN-based character recognition with vision transformers for layout analysis and LLMs for semantic post-correction.

Five major cloud providers (AWS Textract, Azure Document Intelligence, Google Document AI, Mistral OCR, Reducto AI) dominate the premium segment. Tesseract remains the open-source standard for simple cases. For specialised domains (invoices, IDs) there are boutique solutions such as ABBYY, Rossum, Klippa that go beyond raw OCR with structured extraction.

The choice depends on four axes: accuracy on the target document type, price per page, EU or Swiss hosting option, and output format (plain text vs structured JSON with bounding boxes). For fiduciary and legal work, hosting is the decisive filter: local Tesseract, Mistral OCR with EU hosting (May 2026), Azure DI in the Switzerland region, or on-prem ABBYY are the legally safe options.

Why it matters

In every fiduciary office and law firm, scanned documents arrive daily: receipts from abroad, handwritten contract amendments, government letters with stamps, old client files from the archive. Without OCR these are invisible to a RAG pipeline. With bad OCR, client addresses land wrong, amounts get swapped (0/O, 1/I/l confusions), and the AI answer is built on phantom numbers.

Accuracy in May 2026 is significantly better than three years ago. Premium cloud engines reach 99 to 99.8 percent character accuracy on typeset text and 85 to 95 percent on handwriting. Tesseract sits at 92 to 97 percent on typeset (with a good preprocessing setup) but drops to 60 to 75 percent on handwriting.

What matters is not standard accuracy but the handling of edge cases: skewed scans, fax distortion, bleed-through from the reverse, dark zones from stamps, mixed languages (DE/FR on a tax form). This is where the engines differ. AWS Textract shines on tables, Azure DI on forms with key/value pairs, Google DocAI on multilingual government documents, Mistral OCR on layout-faithful Markdown extraction, Reducto AI on complex balance-sheet tables.

Swiss data-protection readiness is the hard filter. Receipts often contain personal data (name, address, AHV number, bank account). Cloud OCR providers outside Switzerland require a data processing agreement, a transfer impact assessment for US providers, and ideally an EU or Swiss region.

How it works

Tesseract (Apache 2.0): runs locally, no cloud risk. Preprocessing (binarisation, deskew, despeckle) dominates quality. Tesseract 5.x with the LSTM model and the right language pack (deu, fra, ita) is solid on typeset text. Layout recognition is weak.

AWS Textract: cloud-only (US, Frankfurt region available). Returns plain text, table structure and form fields with bounding boxes. Price: about 1.50 USD per 1000 pages for text detection, 15 USD per 1000 for Analyze-Document with tables. Very good at tables.

Azure Document Intelligence: cloud (Switzerland North available since 2024). Pretrained models for invoices, receipts, IDs, account statements. Price: about 50 USD per 1000 pages for the layout model, 10 USD per 1000 for read. Best choice when EU/CH hosting is non-negotiable.

Google Document AI: cloud (EU region). Specialised processors for vendor invoices, payslips, tax forms. Price: 30 USD per 1000 pages for the base model. Strong on multilingual government documents.

Mistral OCR (May 2026, EU hosting): new engine with a long-context vision transformer, output as Markdown plus JSON. Price: about 3 USD per 1000 pages. Very strong at layout-faithful conversion. Swiss data stays on Mistral La Plateforme in the EU.

Reducto AI: US startup, premium tier for complex balance and finance tables. 1 to 2 US cents per page. May 2026 announced an EU hosting option.

For receipt-heavy pipelines (fiduciary), we recommend a tiered approach: first a cheap engine (Mistral OCR, Tesseract), then check confidence scores, then route only low-confidence pages to a premium engine. This halves cost without quality loss. For contract OCR we recommend layout-faithful extraction (Mistral OCR, Azure DI layout) plus LLM post-correction that disambiguates characters such as "0/O" from context.

OCR workflow in 6 steps

  1. 01Classify document types: receipt, invoice, contract, government letter, handwriting. Separate pipeline per type.
  2. 02Preprocessing: deskew, denoise, normalise resolution to 300 dpi. ImageMagick or OpenCV.
  3. 03Choose an engine per type: Mistral OCR as default, Tesseract for simple, Azure DI for forms, Reducto for complex tables.
  4. 04Use confidence scores: route pages under 90 percent confidence into a review queue or to a premium engine.
  5. 05Downstream structured extraction: an LLM (Claude Sonnet) extracts the key fields per document type (date, amount, VAT number).
  6. 06Audit trail: original image, OCR output, confidence score, LLM extraction, manual correction. Stored immutably for Art. 957a CO.

When to use which engine

Tesseract: under strict on-prem requirements, small volumes, simple documents. Preprocessing must be solid.

Mistral OCR with EU hosting: default recommendation in May 2026 for Swiss fiduciary and legal work. Layout fidelity, cheap, Swiss-DSG-compliant.

Azure Document Intelligence Switzerland North: when Microsoft stack is already deployed and specialised processors for invoices and receipts are needed.

AWS Textract: when AWS stack is in place and table extraction is the priority. Use the Frankfurt region.

Google Document AI: for multilingual government documents, vendor invoices with pre-training.

Reducto AI: for complex balance-sheet tables when client consent for US cloud exists.

ABBYY FineReader Engine on-prem: for highest accuracy on paper archives with mixed formats, when licence cost (from CHF 5000) is acceptable.

When not to use OCR

If the documents are already digital (text PDF, DOCX, EML): no OCR needed, a document loader (see document-loaders-formate) suffices. OCR on a text PDF is wasteful and reduces quality (character recognition instead of direct text extraction).

For purely visual content without text (photos, diagrams, sketches) OCR is pointless. You need vision embeddings (multimodal models) or a dedicated image-caption pipeline.

For extremely poor scan quality (resolution under 150 dpi, heavy distortion, bleed-through from the reverse) OCR is hopeless. First image restoration (super-resolution, bleed-through removal) via specialised models (PaperMage, ScanTailor), then OCR.

Under strict professional-secrecy requirements (e.g. medical records under doctor-patient privilege) without an on-prem OCR option: better transcribe manually or do not index the document digitally.

Trade-offs

STRENGTHS

  • Premium cloud engines: 99+ percent on typeset, very good table extraction
  • Mistral OCR: EU hosting, layout-faithful Markdown, very cheap
  • Tesseract: open source, local, no running cost
  • Hybrid pipeline (Tesseract + premium escalation): halves cost

WEAKNESSES

  • Cloud OCR sends data out: Swiss-DSG and professional-secrecy risk
  • Tesseract weak at layout, handwriting, multilingual content
  • Premium engines expensive at scale without hybrid strategy
  • Confidence scores not consistently calibrated, manual review still needed

FAQ

Tesseract vs cloud OCR: is Tesseract still worth it?

Yes, but only for simple typeset documents without tables or handwriting. With good preprocessing, Tesseract 5.x reaches 95 percent on invoices and receipts. Once tables, handwriting or multi-column layouts appear, cloud engines are clearly superior.

How do I handle handwritten contract amendments?

Google Document AI with handwriting model and Azure DI are the best options in May 2026 (85 to 92 percent accuracy). For especially critical text (signature-relevant clauses), human verification remains mandatory.

What does OCR cost for 100,000 pages per year?

Mistral OCR: about CHF 270 per year. Azure DI Read: about CHF 900. AWS Textract Analyze-Document: about CHF 1350. Reducto AI: about CHF 1500. On-prem Tesseract: server cost only, about CHF 300 per year for an 8 vCPU machine.

Which engine is Swiss-DSG and professional-secrecy compatible?

On-prem (Tesseract, ABBYY FineReader Engine), Azure DI in Switzerland North, Mistral OCR with EU hosting contract. US-based engines (AWS Textract, Google DocAI, Reducto) require a DPA, transfer impact assessment, and preferably an EU region; for professional secrecy under Art. 321 SCC we advise against them.

Related topics

RAG · AI CONCEPTRetrieval-Augmented Generation (RAG): how AI answers from your own documentsDOCUMENT LOADERS · AI CONCEPTDocument loaders: cleanly ingesting PDF, DOCX, XLSX, EML, HTML and Markdown into RAGPDF TABLES · AI CONCEPTPDF table extraction: Camelot, Tabula, pdfplumber, Table Transformer, MarkerRECEIPT OCR · USE CASEAI receipt recognition for Swiss documents: structured capture of QR-bills, receipts and PDF invoicesDATA CLEANING · AI CONCEPTData cleaning before RAG: duplicates, boilerplate, OCR artefacts, charset issues, watermarksCHUNKING · AI CONCEPTChunking strategies for RAG: fixed-size, recursive, semantic, late chunkingMETADATA · AI CONCEPTMetadata and filters in RAG: pre-filter vs post-filter, Qdrant payload index, pgvector WHERE

Sources

  1. Mistral OCR - product launch with EU hosting · 2026-05
  2. Azure AI Document Intelligence - Switzerland regions and pricing · 2026-05
  3. AWS Textract - pricing and feature comparison · 2026-05
  4. Reducto AI - structured document extraction benchmarks · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call