MULTIMODAL · TREND 2026

Multimodal LLM trend 2026: image, audio and video as standard inputs

May 2026: GPT-4o, the current top Claude model and Gemini 2.5 Pro read images, hear speech and understand video. What that means in practice for fiduciary and receipt workflows.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What does multimodal mean in May 2026?

Multimodal language models accept more than text as input. By May 2026 multimodality is the default at every top vendor, no longer a premium add-on. Concretely the leading models support:

Image (vision): GPT-4o (OpenAI, May 2024), Claude 3.5 Sonnet and the current top Claude model (Anthropic, May 2025 and March 2026), Gemini 2.5 Pro (Google, April 2026), Llama 4 Maverick (Meta, April 2025). All can interpret JPEG, PNG, PDF pages and screenshots – from receipt photos to table scans to whiteboard snapshots.

Audio (voice): the OpenAI Realtime API (GA October 2024, GPT-4o Audio) processes speech at sub-300 ms latency and can answer in audio. Gemini 2.5 Pro Live (March 2026) offers a comparable bidirectional interface. Anthropic announced a voice product for mid-2026 but as of May 2026 it is not GA.

Video: Gemini 2.5 Pro accepts video up to 2 hours in length (input context up to 2 million tokens, about 1 token per frame at 1 FPS). GPT-4o accepts video as a frame sequence (max 50 frames per request). The current top Claude model in May 2026 still supports only still-image sequences, no full video streams.

Why it matters

Multimodality changes three workflows noticeably in fiduciary and law firms.

First receipt processing: those who still run classic OCR pipelines (ABBYY, Tesseract) in May 2026 typically have three stages – scan, OCR, rule-based extraction. With the current top Claude model or GPT-4o Vision this collapses to one stage: the model sees the photo and extracts date, VAT rate, receipt number and a posting suggestion in a single call. Clients can photograph receipts on their phone without anyone smoothing or scanning them. Vendors like Bexio and Klara built this into their apps in Q1 2026.

Second voice meeting prep: a fiduciary could prepare the next hour on the way to a client meeting via voice dialogue in the car. "What were the open points from the last session? How has liquidity moved since?" Voice mode answers immediately. Privacy caveat: the captured audio stream usually goes to the provider – for client data under professional secrecy only allowed with a DPA and EU region.

Third video evidence review: in the legal field Gemini 2.5 Pro enables analysis of surveillance footage or meeting recordings up to 2 hours. Important for May 2026: such analyses are admissible in Switzerland under revFADP only if the persons appearing in the video were informed before recording and the data processing is documented.

How it works

Technically, multimodal models use the same transformer architecture as text-only models, plus specialised encoders for each input modality.

Image: a vision encoder (often ViT-based) splits the image into patches (e.g. 14x14 pixel blocks), turns each patch into a token and feeds the token sequence to the language model. GPT-4o uses roughly 170-1000 tokens per image depending on resolution; the current top Claude model about 1500-1600 tokens for an A4 document. Cost implication: an A4 receipt costs about as much as a 1000-word text request.

Audio: GPT-4o uses an audio encoder that converts 16 kHz mono audio into around 50 tokens per second. A 1-minute voice note costs about 3000 input tokens. The Realtime API streams bidirectionally so that voice output starts to return before the prompt is fully consumed.

Video: Gemini 2.5 Pro samples 1 frame per second by default plus the audio track. A 1-hour clip yields about 3600 image tokens plus the audio. With 2-million-token context there is room for instructions and answer. Important: Gemini samples uniformly; motion-heavy scenes can lose detail. Anyone needing frame-accurate analysis additionally passes the critical seconds as separate still images.

The output again comes back as a token sequence. If you need JSON, combine with structured outputs (see Output-Formatierung).

How to track and adopt this trend in 5 steps

01Market watch: monthly review of pricing and model change pages at Anthropic, OpenAI and Google Cloud. Track token cost per image/audio/second and new modalities.
02Use-case inventory: identify where in your firm someone today turns image, audio or video material into text by hand (receipts, voice notes, meeting recordings). Estimate unit volume.
03Pilot with the cheapest fitting model: for receipts test the current top Claude model or GPT-4o, not directly Opus or GPT-4.5. Specify structured JSON output with examples and schema.
04Privacy check: for each use case verify whether the data may even leave the building. Receipts without personal data – usually unproblematic. Client conversation audio – DPA mandatory, EU region.
05Compare against specialist tools: for pure receipt recognition benchmark cloud OCR (Google Document AI, AWS Textract) as baseline. Only roll out the multimodal LLM if it matches quality and the token cost justifies the workflow gain.

When to use multimodal models

Multimodal models are the right choice when (a) the original information does not exist in text form anyway, (b) converting it to text would be non-trivial, and (c) the business value per case justifies the image or audio token cost.

Concrete use cases running in Swiss SMEs as of May 2026: posting receipts to Bexio/Abacus by phone photo (image). Turning voice notes after client meetings into structured CRM entries (audio). Reading table scans from payroll without sending them through a separate OCR (image). Categorising damage photos in insurance (image).

For image workflows Claude (Vision) has a slight quality edge in May 2026 on layout-heavy structured documents (receipts, tables, forms). GPT-4o Vision is stronger on free image interpretation (damage photos, whiteboards, handwriting). For video Gemini 2.5 Pro is essentially without rival, for voice the OpenAI Realtime API.

When not to use

Multimodal models are the wrong choice when the input already exists as clean text – a plain text LLM is cheaper and faster. When image processing requires absolute precision (e.g. invoice amounts down to the cent without hallucination), benchmark the model output against classical OCR engines (Tesseract, Google Document AI, AWS Textract) – they often deliver the more reliable digits in booking lines, with the LLM doing the categorisation on top.

Other cases discouraged in May 2026: audio processing of client conversations through cloud APIs without a DPA – this violates SCC Art. 321 and revFADP. Video analysis of surveillance material in Switzerland without a documented legal basis. Streaming voice applications running over a WSL/VPN connection with high jitter – the Realtime API responds with disconnects and garbled answers.

Cost trap: image tokens vary widely by model. An A4 receipt on the current top Claude model (USD 3 input per 1M tokens) costs about USD 0.005, on GPT-4o (USD 2.50 input per 1M tokens) about USD 0.003. At 5000 receipts per month that is USD 15-25 – acceptable. The same volume on Claude Opus (USD 15 input per 1M tokens) ends up at USD 100+ without quality gain for this task.

Trade-offs

STRENGTHS

Saves the separate OCR step – photo to posting in one call
Voice mode brings latency below 300 ms – natural dialogue becomes possible
Video understanding up to 2 hours (Gemini) unlocks meeting recordings as a data source
Cost per receipt below CHF 0.01 with mainstream Sonnet/4o-class models

WEAKNESSES

Image token cost varies strongly by model – beware Opus-class on receipts
Audio streams under professional secrecy only acceptable with DPA and EU region
Video sampling can lose motion-heavy detail
Hallucination rate on free images (sketches, damage) still 10-20%

FAQ

Which model for receipts in May 2026?

First choice Claude Sonnet, alternative GPT-4o. On clean smartphone photos both deliver > 95% correct field extraction (date, amount, VAT). On poorly lit or crumpled receipts Claude drops to 88-92%, GPT-4o slightly lower at 85-90%. Tesseract as pure OCR baseline sits below 80%. If booking lines need to be cent-accurate, combine LLM vision with Document AI as a second source.

Can I use the OpenAI Realtime API for client conversations?

May 2026: technically yes, legally only with care. OpenAI offers a DPA and EU data residency for the Enterprise tier. Professional secrecy under SCC Art. 321 requires explicit client consent per recording, a documented deletion concept and ideally a Swiss hosting intermediary. Without that, use voice mode only for internal meetings without clients present.

What are the costs for 1 hour of video on Gemini 2.5 Pro?

One hour of video at default sampling (1 FPS) yields about 3600 image tokens plus the audio track. On Gemini 2.5 Pro (USD 1.25 input per 1M tokens up to 200k context, USD 2.50 above) that is roughly USD 0.10-0.15 input per hour. Output typically stays in the low thousands of tokens, under USD 0.05. Total per hour of video: about USD 0.15-0.25.

Does vision hallucinate more than text?

In structured documents (receipts, tables) the hallucination rate is comparable to text – under 5% with the current top Claude model. In free images with ambiguous interpretation (sketches, poor shots) it climbs to 10-20%. Countermeasures: structured JSON output with required fields and an "unknown" option, plus citation checking on receipts ("show me the field you read this value from").

Sources

OpenAI Platform – GPT-4o and Realtime API model docs · 2026-05
Anthropic – the current top Claude model model card and vision pricing · 2026-03
Google Cloud – Gemini 2.5 Pro multimodal documentation · 2026-04
Meta – Llama 4 multimodal release notes · 2025-04
Bexio – Receipt scan-by-phone announcement (release notes) · 2026-02

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call