REASONING · TREND 2026

Reasoning model trend 2026: o3, R1, Extended Thinking and the test-time-compute boom

May 2026: OpenAI o3, Claude with Extended Thinking, Gemini 2.5 Pro Thinking and DeepSeek-R1. When the several-times token premium is worth it for SMEs.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What are reasoning models in May 2026?

Reasoning models are language models that insert a longer, internally visible thinking phase before the actual answer. Instead of replying directly, they first generate a "chain of thought" – a step-by-step deliberation that itself becomes part of the answer material. Only after this phase does the finished answer follow. The term and method entered the market with OpenAI o1 (September 2024).

Four model lines matter in May 2026:

OpenAI o3 and o3-mini: o3 was introduced in late January 2025, GA in April 2025. Pricing in May 2026 (per the OpenAI pricing page): USD 60 output per 1M tokens, USD 15 input. o3-mini at USD 4.40 output / USD 1.10 input. Reasoning tokens count as output. successor models (per provider announcements) are set to bring better tool use at a similar price.

Anthropic Claude with Extended Thinking: available as a mode. Same model as Claude Sonnet, with an activatable "Extended Thinking" mode. Pricing stays at USD 3 input / USD 15 output per 1M tokens – reasoning tokens count as output. On average 5-12x more output tokens than without thinking, depending on complexity.

Google Gemini 2.5 Pro Thinking: available. Pricing USD 2.50 output / USD 1.25 input up to 200k context. Reasoning tokens visible as output – the developer sees the thinking phase in the response stream.

DeepSeek-R1: open-weight (MIT licence), released January 2025. R1 shocked the industry in January 2025 – comparable maths performance to o1, but considerably cheaper and open-weight. It is the only reasoning model named here that can be fully self-hosted.

Why it matters in 2026

The real breakthrough of reasoning models is not the individual model but a shifted scaling logic. Until 2024 the rule was: bigger model + more training compute = better answers. In 2025/2026 a second axis appears: more inference compute per query = better answers (test-time compute). OpenAI made this explicit in a September 2024 research note, Anthropic confirmed it with Extended Thinking.

Three consequences for SMEs.

First: on mathematical, logical and multi-step analytic tasks there is a clear quality jump. On AIME maths (American Invitational Mathematics Examination 2025) o3 reportedly reaches a multiple of the hit rate of non-reasoning models like GPT-4o; concrete values vary by test. On SWE-bench Verified (software engineering tasks) Claude with Extended Thinking reportedly scores markedly higher than without thinking. For fiduciary tasks with a clear logic component (complex VAT cases, international tax situations, liquidity scenarios) this is meaningful.

Second: cost is 5-15x higher. A standard request with the current top Claude model without thinking costs about CHF 0.01-0.03; with Extended Thinking on the same problem typically CHF 0.05-0.20. At 200 requests per day that is the difference between CHF 60 and CHF 600 per month. Whoever applies reasoning to every request finances very expensive answers that would be just as good without thinking.

Third: latency rises from 1-3 seconds to 10-90 seconds. Reasoning models are not for chat interfaces – they are for background processing, second-stage email triage, deep contract analysis.

How it works

Reasoning models emerge from a combination of training and API changes.

In training: the model is additionally trained with reinforcement learning on reasoning data. OpenAI o1/o3 uses an RL pipeline on maths and code datasets, rewarding the model when long thinking phases produce correct answers. DeepSeek-R1 (January 2025) described the procedure publicly – "GRPO" (Group Relative Policy Optimization) without a reward model. Claude with Extended Thinking uses a similar but proprietary approach per Anthropic.

In the API: the call takes a parameter that controls the thinking phase. Anthropic the current top Claude model: thinking.type: 'enabled' with budget_tokens (e.g. 16000 tokens). OpenAI o3: reasoning.effort: low|medium|high. Gemini 2.5 Pro: thinkingConfig with budget. The model then uses between 1000 and budget_tokens for thinking before producing the answer.

In the response stream: with Anthropic and Gemini the thinking phase is visible (thinking block in the response). With OpenAI o3 it is hidden – the developer only sees the token bill. Visible thinking advantage: it can be audited and debugged, error sources identified. Disadvantage: the thinking phase sometimes contains sensitive intermediate reasoning that should not be logged.

Best practice in May 2026: activate reasoning mode only on requests with multi-step logic. On simple writing or summarisation it does not raise quality, only cost. A routing layer (LiteLLM, OpenRouter, custom classifier) decides per request whether thinking is enabled.

How to track and adopt this trend in 5 steps

01Market watch: monthly review of pricing pages at OpenAI, Anthropic, Google and DeepSeek for new reasoning models and price changes. Track benchmark updates (AIME, GPQA, SWE-bench).
02Use-case inventory: identify 3-5 tasks in the firm that today require more than 30 minutes of manual analysis per case. These are reasoning candidates.
03Pilot with the cheapest reasoning model: start with DeepSeek-R1 or o3-mini, not directly o3. On Anthropic begin with a small thinking budget (4000 tokens).
04Build routing logic: before each request decide whether reasoning is needed. Classifier prompt ("is this a logic task with more than 2 steps?") or hard rules (only for tax questions, VAT special cases, international topics).
05Cost monitoring: log token use per request type (Langfuse, Helicone). As soon as the reasoning share exceeds 15% of total token budget, tighten routing.

When to use reasoning models

Reasoning models are the right choice when (a) the task requires multi-step logic (calculations with conditions, comparing tax scenarios, algorithm sketches), (b) errors are expensive and (c) 10-60 second latency is acceptable.

Concrete Swiss SME use cases in May 2026: complex VAT cases (reverse-charge with third country, margin taxation, own-use), international tax situations (Swiss employee in German home-office with Austrian family), liquidity scenarios with FX and seasonal swings, legal reasoning across several instances, code review for accounting scripts.

In each of these the token cost shifts from "noticeable" to "justified by the outcome". A VAT question a fiduciary would otherwise research for an hour (~CHF 150 effort), solved by Claude with Extended Thinking in 60 seconds with documented reasoning (CHF 0.20 token cost), is an obvious math.

Not every reasoning use case needs the most expensive model. DeepSeek-R1 reaches near o1 levels on maths benchmarks at one third or less of the cost. For open-weight compatibility (self-host, EU region) DeepSeek is the better fit.

When not to use

Reasoning models are the wrong choice when (a) the task is simple writing, summarising or classification, (b) latency must stay below 5 seconds or (c) the output will be reviewed by a human anyway.

Concrete avoidance cases in May 2026: stage-1 mail triage (bucket classification) – Sonnet/4o-mini suffice, reasoning is overkill. Receipt recognition – standard multimodal models pull the fields reliably, thinking adds nothing. Frontend client chat – latency above 5 seconds drives users away. Standard translation – reasoning does not improve language quality, only logic.

Special cost trap: applying reasoning to RAG requests where the answer sits directly in the retrieved material. The model thinks for 5000 tokens about something it could cite in 200 output tokens. Without prevention this inflates RAG pipeline token cost by a factor of 10. Check: if the answer is clearly contained in the retrieval results, turn reasoning off.

Problem with hidden reasoning tokens: OpenAI o3 charges reasoning tokens but does not show them. If max_tokens is set too low you get a truncated or empty answer. From the OpenAI cookbook (May 2026): set at least 25000 max_completion_tokens for o3. Anthropic is more honest here – the thinking phase is transparently billable.

Trade-offs

STRENGTHS

Clear quality jumps on maths, logic and code (50-80% improvement)
Transparent thinking phase at Anthropic and Gemini – auditable for compliance
DeepSeek-R1 as open-weight alternative – EU hosting possible
Per-case cost often below manual senior effort on complex questions

WEAKNESSES

Token cost 5-15x higher than standard models
Latency 10-90 seconds – not for live chat
OpenAI o3 hides reasoning tokens – harder cost control
Hallucinations become rarer but harder to spot inside long thinking phases

FAQ

Is o3 worth it for a 5-person fiduciary?

Rarely directly. For daily work Claude Sonnet or GPT-4o are enough. o3 pays off when 20-50 truly complex cases per month show up (international tax, multi-stage VAT) that today take a senior fiduciary 1-2 hours of manual analysis. At 30 cases / month x CHF 1 token cost / case = CHF 30 – versus 30 senior hours (CHF 4500). The maths works, provided the output is reviewed.

What is the difference between DeepSeek-R1 and o3?

Licence and hosting. DeepSeek-R1 (open-weight, MIT licence) can be hosted on your own hardware or in the EU (e.g. via Fireworks, Together, Hetzner GPU). o3 is available only via OpenAI, data flowing to the US. On quality R1 sits about 5-10 percentage points behind o3 on maths benchmarks, closer on code tasks. Whoever needs data sovereignty and EU hosting accepts that quality gap deliberately.

Can reasoning models hallucinate?

Yes, less but not zero. Observation in May 2026: long thinking phases can even amplify hallucinations when the checking phase inside the reasoning is flawed. Anthropic's Extended Thinking documentation (March 2026) explicitly warns that even thoughtfully worded answers can contain wrong facts. Countermeasure: combine reasoning with RAG – the model thinks but has hard sources.

Reasoning or multi-agent – which is better in 2026?

For a single complex problem: reasoning is usually more accurate, simpler and cheaper than a 3-agent setup with coordinator. For tasks that require external tools (DB query, API call): tool use with or without reasoning, multi-agent only when the task genuinely parallelises. The engineering consensus in May 2026: one good reasoning call beats three loose agents.

Sources

OpenAI Platform – o3 and o3-mini model docs and pricing · 2026-05
Anthropic – Claude mit Extended Thinking documentation · 2026-03
Google AI – Gemini 2.5 Pro Thinking guide · 2026-04
DeepSeek-AI – DeepSeek-R1 paper (GRPO method) · 2025-01
OpenAI Cookbook – reasoning best practices · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call