MULTIMODAL · AI CONCEPT

What is multimodal AI? Image, audio, video plus text May 2026

Multimodal models process not only text but also image, audio and video. May 2026: GPT-4o, Gemini 2.5 Pro, the current top Claude model, Llama 4. Use cases for document recognition and damage photos.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is multimodal AI?

Multimodal AI refers to models that process more than one data modality – typically text plus image, often also audio, increasingly video. Instead of two separate systems (one for OCR, one for language), a multimodal model understands all inputs in a unified internal representation and can relate them. Whoever gives a model a receipt PDF and the question "Which receipts are incomplete?" gets a sensible answer – the model has read the image and formulated the answer in text.

The architecture is in May 2026 typically a "late fusion" setup: a specialised vision encoder (often a variant of Vision Transformer, ViT, or CLIP) turns the image into a sequence of vectors. An adapter (small bridge) brings these vectors into the same format as the text tokens of the language model. The language model then processes image and text tokens in the same transformer-layer sequence. For audio it works analogously with an audio encoder (often Whisper-like). Video is typically treated as a frame sequence – 1-2 frames per second suffice for most tasks.

As of May 2026 the multimodal landscape is established:

- GPT-4o / the current top GPT model (OpenAI): natively multimodal since May 2024, text + image + audio + video. Audio understanding and generation in the same model instance. - Gemini 2.5 Pro / Ultra (Google): natively multimodal from the start, text + image + audio + video. May 2026 market leader for long video (60+ minutes understanding). - the current top Claude model (Anthropic): text + image since Claude 3, audio not official, video via frame extraction. - Llama 4 Maverick (Meta, May 2026): natively multimodal – text + image via frame sequence. Open-weight, self-hosting possible. - Mistral Pixtral 12B (Mistral): open-source text + image, much smaller but good for EU self-hosting. - Qwen 2.5 VL (Alibaba): open-source vision language, very strong in document OCR.

For SME users the most important consequence: a single model can cover document recognition, business card scan, damage image analysis and mixed Q&A workflows. Before 2024 that required 4-6 specialised tools (OCR engine, form parser, speech recognition, classifier). Today a multimodal LLM covers it in one API.

Why multimodal AI matters for SMEs

Multimodal AI solves classic Swiss SME problems directly. Four concrete areas.

First: document recognition instead of manual entry. A fiduciary employee typically spends 30-60 minutes per client per month on receipt entry (receipts, invoices, bank statements). With a multimodal LLM (see ai-belegerkennung-ocr) this drops to 2-5 minutes of checking. With 50 clients and CHF 80/h staff hourly rate: monthly 35-50 hours saved, around CHF 30,000-50,000 per year. As of May 2026 multimodal LLMs deliver over 95% accuracy on standard receipts (Coop receipt, ZKB statement, Migros invoice) – remaining control stays necessary, but effort shifts from entry to validation.

Second: damage assessment in insurance. Whoever sends a damage photo and damage report to Gemini 2.5 Pro or the current top GPT model gets a first assessment (damage severity, plausible repair cost, photo consistency with report) in 5-15 seconds. Swiss insurers integrate this in May 2026 into initial processing of hail, motor and contents claims. Staff make the final decision, the model accelerates preparation.

Third: meeting minutes from audio. A 60-minute board meeting typically yields 3-4 hours of manual minute work. With Whisper-like audio models plus LLM summary it is 5-15 minutes of checking. May 2026 standard stack: OpenAI Whisper Large v3 (open-source, self-hosting possible, free) plus the current top GPT model or the current top Claude model for structuring. Law firms use it for client conversations, SME boards for management meetings.

Fourth: business card scan and address entry. A simple but underestimated use case: photo of a business card → structured contact data into CRM. Multimodal LLMs do it with >97% accuracy. Built into a web app (image upload → JSON output), that is a one-person-day development effort.

Cost May 2026. Image tokens are billed separately. Typical prices: the current top GPT model about USD 1.30 per 1,000 images (768x768 pixels), Gemini 2.5 Pro about USD 0.50 per 1,000 images, Claude Sonnet about USD 4.80 per 1,000 images. A fiduciary with 300 receipts/month pays USD 0.15-1.50/month for image processing – negligible against staff hours saved.

Strategic consequence. Multimodal is no longer a premium feature in May 2026 but standard. Every project working with physical documents or images should evaluate multimodal AI – not "someday later". The architectural maturity is here.

Multimodal architecture in detail

Four building blocks make up a multimodal LLM: vision encoder, adapter, language model, training data.

Block 1: vision encoder. A specialised neural network turns images into vectors. Standard May 2026: Vision Transformer (ViT), often variant "ViT-Large" or "ViT-Huge". An image is split into patches (typically 14x14 or 16x16 pixels per patch), each patch is embedded, all patches traverse a small transformer (typically 24-32 layers). Output: a sequence of 196-1,024 vectors per image, depending on resolution and patch size. CLIP (Contrastive Language-Image Pre-Training, OpenAI 2021) is the standard variant – a ViT trained on image-text pairs so that similar image and text vectors lie near each other. As of May 2026 nearly all multimodal LLMs use CLIP-like vision encoders.

Block 2: adapter. Vision-encoder vectors are not in the same format as text tokens of the language model (different dimensions, different distribution). An adapter – typically a small MLP with 2-4 layers or a cross-attention layer – translates them. Result: the image becomes "image tokens" that the language model can treat like text tokens. As of May 2026 typically 196-512 image tokens per image, depending on architecture. This count matters for cost – an image costs as much as 196-512 text tokens.

Block 3: language model. The actual LLM (Llama, GPT, Claude, Mistral, Gemini architecture) now processes a sequence of image tokens and text tokens. From the model perspective the sequence is a unified input. Self-attention (see was-ist-attention-mechanismus) connects image and text – the model learns "word 5 refers to the object in the upper left of the image". The answer is always text – tokens generated autoregressively by the LLM.

Block 4: training data and procedure. Multimodal models have three additional training phases beyond pure text pretraining:

- Vision-encoder pretraining: CLIP trains on 400 million to 2 billion image-text pairs from the web – image captions, alt texts. Result: a robustly trained vision encoder with broad semantic understanding. - Multimodal pretraining: the LLM is further trained on image-text pairs so it learns to process image and text tokens in a shared context. Typical volume: 1-10 billion image-text pairs. - Multimodal fine-tuning and RLHF: the model is further trained on specific tasks (OCR, image captioning, visual Q&A, document understanding) and preference data.

Audio and video May 2026. Audio works analogously with an audio encoder (often a variant of Whisper, an OpenAI architecture for speech recognition). Audio is converted to mel spectrograms, then passed through the encoder, then via adapter into the LLM. Video is typically frame extraction plus image processing in May 2026 – 1-2 frames per second, each processed as image. True "video-native" models (with temporal attention) exist in May 2026 only at Google (Gemini 2.5 Pro/Ultra) – other vendors treat video as frame sequence.

Understand multimodal AI in 5 steps

01Understand the architecture: vision encoder (CLIP-like) plus adapter plus language model – image becomes tokens, language model answers in text.
02Check the vendor landscape May 2026: das jeweils aktuelle GPT-Spitzenmodell, Gemini 2.5 Pro, das aktuelle Claude-Spitzenmodell (cloud) plus Llama 4, Pixtral, Qwen 2.5 VL (self-hostable).
03Identify use cases in your own house: receipt recognition, damage photos, meeting audio, contracts, business cards.
04Estimate costs: typically USD 0.5-5 per 1,000 images at cloud vendors, self-hosting cheaper at high volume.
05Check compliance: client files and professional secrecy can force self-hosting – check EU zones, DPA and open-source alternatives.

When multimodal AI is the right choice

Five clear SME occasions where multimodal AI shortens effort and waiting time.

Occasion 1: receipt and invoice capture. Fiduciary, accounting, construction. Photo or scan of receipt/invoice → structured JSON data (amount, VAT, date, supplier, booking-category suggestion). As of May 2026 the current top GPT model and Gemini 2.5 Pro deliver >95% accuracy on standard Swiss receipts. Remaining control still needed but capture time falls by factor 5-10. See ai-belegerkennung-ocr for implementation details.

Occasion 2: damage photo analysis. Insurance, damage assessment. Damage photo + damage report → assessment (severity, repair cost estimate, photo-report consistency). Gemini 2.5 Pro leads in May 2026 in damage-assessment tests – very strong in recognising hail damage, motor dents, water damage. Output: a structured next-steps proposal for the human caseworker.

Occasion 3: contract and document analysis. Law, fiduciary, compliance. PDF contract (often mix of text and tables) → structured content summary with clause classification, risk hints, comparison to standard templates. Multimodal LLMs can process PDF images (scanned contracts) directly without separate OCR. Very useful with legacy contracts with handwritten remarks.

Occasion 4: meeting and appointment minutes. Board, client conversation, authority meetings. Audio recording → transcript (Whisper) → structured minutes with task list, decisions, next appointments. May 2026 standard latency: 60-min audio → finished minutes in 3-5 minutes compute. Whisper Large v3 plus the current top Claude model or the current top GPT model is the typical pipeline.

Occasion 5: multilingual visual communication. Tourism, hospitality, international SMEs. Photo of a multilingual sign/menu + question → translation and explanation. Photo of a menu + "what is vegan, gluten-free?" → structured answer. Photo of a use label + "how long does it keep?" → date extraction and interpretation. Multimodal LLMs in mobile apps are very strong in May 2026 for such applications.

Occasion 6: business cards and address capture. Sales, networking. Photo of business card → contact record in CRM. >97% accuracy on standard Swiss business cards. Very easy to implement – typically a one-person-day for web app integration.

When multimodal AI is not the right choice

Three cases in which classic reading tools or specialised OCR fit better.

First: highly structured industry receipts with existing OCR solution. Whoever already has a productive ABACUS/Bexio/Sage OCR with 99%+ accuracy for standard receipts often does not gain much by switching to multimodal LLM. Advantage emerges only with mixed receipts, handwriting or mixed layout. Check per receipt type whether the existing solution is actually the bottleneck.

Second: highly sensitive client files without compliance architecture. Multimodal LLMs in May 2026 are predominantly US cloud (GPT, Claude, Gemini). Whoever processes client files under professional secrecy (Art. 321 SCC) needs a compliance architecture (EU zones, DPA, sub-processor chain). Open-source alternatives (Llama 4 Multimodal, Pixtral, Qwen 2.5 VL) are self-hostable but 5-15% quality gap compared to the current top GPT model/Gemini.

Third: realtime mass-scan applications with hard latency constraints. Whoever needs 5,000 receipts per hour with under 200ms latency per receipt (industrial scan line) is better served with classic OCR (Tesseract, Google Document AI, AWS Textract). Multimodal LLMs are typically 1-5 seconds per image in May 2026 – good for 100-500 receipts/hour, not for 5,000/hour.

Trap "multimodal replaces all OCR tools". No. Specialised OCR for numerics (number extraction from tables), barcode scan, QR code, EAN remains faster and more accurate than multimodal LLM. Multimodal LLM complements classic OCR – combining is often the best architecture in May 2026.

Trap "multimodal LLM understands every language". As of May 2026 markedly better than 2024 but not perfect. Asian scripts (Chinese, Japanese, Korean) are strong in top models, Arabic and Hebrew acceptable, more exotic scripts (Thai, Vietnamese, Indic scripts) variable. For EU/CH/DE/FR/IT/EN applications no worry in May 2026.

Trap "multimodal LLM understands all images equally". Barcodes, OCR code fields, hand diagrams, technical plans still have weaknesses in May 2026. Photo realism is very good, abstract symbolic and technical notation moderate.

Trade-offs

STRENGTHS

One model replaces 4-6 specialised tools (OCR, form parser, speech recognition, classifier)
Directly understandable instructions ("extract amount and VAT from this receipt")
Very good quality on standard receipts and damage photos in May 2026
Self-hosting possible via Llama 4, Pixtral, Qwen 2.5 VL for compliance

WEAKNESSES

Image tokens cost extra – 1 image = 1.5-3 A4 pages text equivalent
Latency 1-5 seconds per image – not for mass-scan production lines
Specialised OCR (barcodes, EAN codes, technical notation) stays weak
Compliance risk at cloud vendors for highly sensitive data

FAQ

Which multimodal model is best for receipt recognition May 2026?

In independent tests in May 2026 Gemini 2.5 Pro leads (very strong on Swiss receipts, low price) and the current top GPT model (slightly more expensive but very reliable JSON output). The current top Claude model is solid but more expensive. For self-hosting: Qwen 2.5 VL (open-source, very strong on documents) and Llama 4 Maverick (open-weight, multimodal). Always test with 30-50 of your own receipts – synthetic benchmarks rarely match your receipt mix.

Can I send video directly to a multimodal LLM?

Partially. As of May 2026 Gemini 2.5 Pro/Ultra leads in native video processing – videos up to 60 minutes with temporal attention. Other vendors (the current top GPT model, the current top Claude model) process video as frame sequence: typically extract 1-2 frames per second and feed as images. Sufficient for 90% of applications. Truly video-native applications (motion analysis, audio-video correlation) remain Gemini territory in May 2026.

What does an image cost compared to text?

A standard image (768x768 or 1024x1024) is billed as 196-512 image tokens. At the current top GPT model USD 1.25 input per 1M tokens, an image costs about USD 0.0003-0.0006 – so an image = around 1.5-3 A4 pages of text. Gemini 2.5 Pro is cheaper (USD 0.0002-0.0004 per image), the current top Claude model more expensive (USD 0.001-0.002 per image). For fiduciary with 300 receipts/month: USD 0.10-0.60/month image costs.

Can I self-host multimodal AI completely?

Yes. As of May 2026: Llama 4 Maverick (open-weight, multimodal, 400B/17B MoE), Pixtral Large (Mistral, multimodal), Qwen 2.5 VL (very strong on documents). Hardware needs: 1-2 H100-80GB GPUs for mid-class models. For compliance-critical Swiss applications (client files, professional secrecy) self-hosting is the only clean option. Quality gap to cloud top models: 5-15% in independent tests.

Sources

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call