REASONING · AI CONCEPT
What is a reasoning model? o3, Claude Extended Thinking, the current DeepSeek-R generation May 2026
Reasoning models think internally in chain-of-thought before answering. More tokens for thinking = better answers in maths, code, logic. Costs 5-15x more than regular models.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is a reasoning model?
A reasoning model is a language model that executes an internal "thinking phase" with chain-of-thought before the final answer. Instead of directly generating the most likely next token, the model first produces a longer internal argumentation – writing down steps, posing hypotheses, checking intermediate results, correcting errors – and only then outputs the compact answer to the user. This internal argumentation can be 1,000-50,000 tokens long depending on the task, much more than the typical 200-token answer of a standard model.
The breakthrough came with OpenAI o1-preview (September 2024) and o1 (December 2024). Predecessors existed – chain-of-thought prompting (Wei et al. 2022) showed models with "think step by step" instructions become better. But o1 was the first model in which reasoning was explicitly incentivised in the training procedure with reinforcement learning – the model learns by itself how to build long internal argumentations.
As of May 2026 the family is established:
- OpenAI o3, o3-mini, o3-pro (May 2026): the most advanced OpenAI reasoning family. o3-pro leads in the hardest benchmarks (FrontierMath, GPQA Diamond). - Claude with Extended Thinking (Anthropic): reasoning is integrated into the normal model, optionally activated via API parameter "thinking". Lower or higher thinking budgets steerable. - the current DeepSeek-R generation (DeepSeek, April 2026): open-weight reasoning model, very strong in code and mathematics, markedly cheaper than o3. - Gemini 2.5 Pro Thinking (Google): similar architecture to Claude – reasoning optionally activatable. - Qwen 3 Thinking (Alibaba): open-source reasoning model, competitive in code and maths.
For SME users the most important consequence is: reasoning models are 5-15x more expensive per request than standard models but deliver markedly better results on specific tasks (mathematics, code debugging, multi-step logical analysis, difficult contract clauses). They are not "the new standard for all requests" but a specialised tool for hard problems. The right routing between standard and reasoning model (see was-ist-llm-gateway) is the most important efficiency lever in May 2026.
Why reasoning models matter for SMEs
Reasoning models touch SME business logic in three specific areas.
First: complex tax and accounting questions. When a client request lets several tax articles interact (e.g. "Is the distribution from my GmbH in Zug tax-free for me as a private person in Geneva, if I keep the participation rate below 10% but own it longer than 1 year?"), standard models in May 2026 often deliver answers with subtle errors. Reasoning models walk through the request step by step: check participation rate, check holding period, check cantonal special rules, clarify distribution definition. On multi-step questions error rate drops from 15-30% (standard) to 3-8% (reasoning). That is not zero but relevant for SME advisory.
Second: code generation and debugging. Whoever runs an SME with Excel VBA, Python accounting scripts or interface connectors profits massively from reasoning models. OpenAI o3-mini, Claude with Extended Thinking and the current DeepSeek-R generation score 30-60% above standard models in independent code benchmarks (SWE-Bench Verified May 2026, HumanEval+, LiveCodeBench). A 4-hour debugging problem (standard model cannot solve in 30 minutes of chat iteration) is often solved in 2-5 minutes with a reasoning model.
Third: contract and legal analysis. Law firms and fiduciary offices with contract-review tasks see marked quality improvements with reasoning models in May 2026. Examples: "Review this rental contract for clauses that are invalid under Swiss tenancy law and justify your assessment." Standard the current top Claude model delivers a competent list but misses 1-3 nuanced issues. Claude with Extended Thinking or o3-pro works the checklist systematically and typically finds 2-5 additional points. Not a majority of cases but valuable on critical contracts.
Fourth point: no miracle cure. Reasoning models are NOT better at language tasks (email reply, client communication, summary), at simple classifications or at creative writing. Here standard model output is equally good at factor 5-15 lower price. Whoever uses reasoning for everything burns money.
Cost May 2026. Typical prices per 1M tokens:
- OpenAI o3-mini USD 1.10 input / USD 4.40 output (regulated), o3 USD 15 / USD 60. - Claude Sonnet with Extended Thinking: standard price plus thinking tokens (thinking tokens are billed at the output price). - the current DeepSeek-R generation: USD 0.55 input / USD 2.19 output – by far the cheapest reasoning model in May 2026. - Gemini 2.5 Pro Thinking: USD 1.25 input / USD 10 output (with thinking).
For a typical fiduciary request (3,000 tokens input, 800 tokens output) with reasoning thinking budget of 5,000 tokens: o3-mini about USD 0.03, Claude Sonnet Extended Thinking about USD 0.10, the current DeepSeek-R generation about USD 0.02. Comparison Claude Sonnet without reasoning: about USD 0.02. The gap is factor 1.5-5 – not extreme but noticeable at 1,000+ requests/day.
Strategic consequence. Reasoning models are a specialised tool, not a universal upgrade. Whoever uses LLM gateway logic routes simple requests to standard models and only complex ones (with input patterns like "review", "compute", "debug") to reasoning models.
Reasoning models in detail
Three building blocks make up a reasoning model: extended chain-of-thought, reinforcement learning on reasoning trajectories, separated thinking tokens.
Block 1: extended chain-of-thought. Instead of producing an answer directly, the model internally generates a long argumentation sequence. This sequence contains explicit steps: "First, I need to clarify whether X holds. Second, for that I need Y. If Y holds, then Z. Let us check Y: ..." The sequence can be 1,000-50,000 tokens depending on difficulty and configuration. With the OpenAI o family and Gemini Thinking this sequence is typically invisible to the user – only the final answer is returned. With Claude with Extended Thinking it is optionally visible in API output. With the current DeepSeek-R generation fully visible (research transparency).
Block 2: RL on reasoning trajectories. The decisive training difference to standard models. In reasoning training the model is given hard problems with verifiable solutions (maths problems with numerical answers, code tasks with tests, logic puzzles with definitive answers). The model generates various reasoning trajectories. Those leading to correct solutions are rewarded; those failing are punished. Over millions of such episodes the model learns to build long reasoning sequences, recognise errors and integrate corrections. This procedure is called RLVR (Reinforcement Learning with Verifiable Rewards) and was popularised for o1.
Block 3: thinking tokens vs output tokens. With modern reasoning models thinking tokens and output tokens are billed separately. Both cost the output price (typically 3-5x more expensive than input). Vendors offer configuration parameters in May 2026:
- OpenAI o3: "reasoning_effort" with values "low", "medium", "high" – determines thinking budget. "low" about 2,000 tokens, "high" up to 50,000 tokens. - the current top Claude model: "thinking.budget_tokens" – direct token value, typically 1,000-32,000. - Gemini 2.5 Pro Thinking: "thinking_budget" parameter, similar to Claude. - the current DeepSeek-R generation: implicit, the model output contains first a reasoning block then the answer.
Concrete example. A request: "Calculate whether I as a Swiss with residence in Zug and main job in Liechtenstein use the double-taxation treaty optimally." Standard the current top Claude model answers in about 500 tokens with a listing. o3 with medium reasoning effort generates internally about 8,000 thinking tokens – check which DTA articles are relevant, run through a calculation example, walk through special cases (cross-border commuter, 183-day rule) – and then gives a 1,200-token answer. The answer is not necessarily "more correct" but typically more precise in special cases and cleaner in justification.
Latency May 2026. Reasoning models are slower. Standard the current top Claude model answers a request in 3-8 seconds. The current top Claude model with Extended Thinking (budget 8,000) takes 15-40 seconds. o3 with "high" effort can take 30-180 seconds. For interactive chat applications often acceptable, for realtime voicebots a no-go. Streaming API gives drip-answers at some vendors in May 2026 – user sees the reasoning live, reducing perceived latency.
Open-source reasoning May 2026. the current DeepSeek-R generation (April 2026, open-weight) is the leading self-hosting reasoning model. Hardware: 1x H100 for the 32B variant, 2-4x H100 for the 671B variant. Qwen 3 Thinking also open-source, competitive in maths and code. Llama 4 has no dedicated reasoning model in May 2026 (yet), but it is expected.
Understand reasoning models in 5 steps
- 01Understand the principle: reasoning models think internally in chain-of-thought (1,000-50,000 tokens) before answering.
- 02Check the vendor landscape May 2026: OpenAI o3/o3-mini, Claude mit Extended Thinking, die aktuelle DeepSeek-R-Generation, Gemini 2.5 Pro Thinking, Qwen 3 Thinking.
- 03Identify high-value use cases: hard tax/legal questions, code debugging, multi-step data analysis, contract review.
- 04Estimate cost: reasoning typically 5-15x more expensive than standard. Per 1,000 complex requests per month USD 50-500 extra.
- 05Build routing logic: simple requests (email, triage) to standard model, complex requests ("review", "compute", "debug") to reasoning model.
When to use reasoning models
Four concrete SME scenarios for reasoning models.
Scenario 1: hard fiduciary questions with nesting. When the request stacks several rule layers – DTA application with cantonal special rules, VAT treatment in cross-border supply chains through third countries, added-value accounting in holding structures – the standard model often feels uncertain. The reasoning model walks through the layers cleanly. Example: "Check whether the transfer from the GmbH in Zug to my private account in Germany constitutes a hidden distribution and what tax consequences this has in CH and DE." With o3 or Claude with Extended Thinking the answer quality is typically 30-50% more precise.
Scenario 2: code debugging. Whoever maintains an SME IT system – accounting interfaces, ERP plug-ins, Excel macros – regularly hits 1-4 hour debugging loops. Reasoning models in IDE integration (Cursor with Claude with Extended Thinking, GitHub Copilot with o3-mini, Cline with the current DeepSeek-R generation) often shorten this to 2-5 minutes. Investment pays off from 5+ hours of debugging per month. Recommendation May 2026: the current DeepSeek-R generation as cheap option (self-hosting or API), o3-mini for integrated OpenAI workflows, Claude with Extended Thinking for high-quality IDE workflows.
Scenario 3: contract review with risk justification. Law firms and fiduciary offices with contract-review tasks see marked benefit in May 2026. "Check this supply contract for the client against Swiss and EU rules and list the top 5 risks with paragraph references." Reasoning models go systematically through standard clauses, find subtle issues (convention penalty levels, warranty exclusions, data protection clauses) and justify with paragraph references. Standard models find typically 80% of problems, reasoning models 92-96%.
Scenario 4: multi-step data analysis. When a request requires data from multiple sources and produces logical linkages: "Compare the VAT quotas of my last 4 quarters and identify anomalies against industry mean." Reasoning models run the calculation cleanly, identify special cases and deliver clean justifications. Tool use (database query, calculator) is sensible to combine here.
Scenario 5: do not use – standard language tasks. Email replies, client newsletters, dunning letters, meeting-minute structuring: standard models (Claude Sonnet, Gemini 2.5 Pro, the current top GPT model) deliver equally good results at factor 5-15 lower price. Reasoning here is money-burning.
When reasoning is not the right approach
Three clear cases against reasoning models.
First: simple language tasks. Email triage, draft replies, client newsletters, dunning letters, content classification. Standard models are equally good here. Whoever uses reasoning for email replies pays factor 5-10 too much and gains nothing in quality.
Second: latency-critical applications. Voicebots, realtime chat in customer support, interactive UI help. Reasoning takes 15-180 seconds – feels dead in any realtime application. Prefer standard models or faster variants (Claude Haiku, Gemini 2.5 Flash, the current top GPT model Mini) here.
Third: mass-scale applications with token cap. When you have 100,000+ requests per day (e-commerce product descriptions, automatic tag generation), reasoning costs factor 5-15 more – at 100,000 requests/day that is USD 500-5,000/day extra cost without sensible quality gain.
Trap "reasoning is always better". As of May 2026 independent benchmarks show: reasoning models beat standard models in maths, code and logic by 20-50%. In language tasks (generation, style, empathy) differences are negligible or even slightly negative – reasoning models can phrase "overcautiously". Whoever classifies task type wrongly loses money and latency without value.
Trap "we use reasoning for safety". Reasoning models are not necessarily "less hallucinated". They are better in multi-step logic but hallucinate equally on factual questions not in the training corpus. Whoever wants to minimise hallucination builds RAG (see retrieval-augmented-generation) and citation check – reasoning alone does not solve that.
Trap "we train our own reasoning model". Reasoning training needs elaborate RLVR pipelines with verifiable tasks. As of May 2026 not realistic for SMEs – not because of compute costs (DeepSeek trained R1 for USD 5-6 million), but because of pipeline complexity. For SMEs: use existing reasoning models, do not build your own.
Trade-offs
STRENGTHS
- Markedly better quality in maths, code, logic (20-50% advantage)
- Clean step-by-step argumentation, more traceable than standard model output
- Self-hosting possible via the current DeepSeek-R generation and Qwen 3 Thinking
- Visible reasoning supports audit duties under EU AI Act
WEAKNESSES
- Costs 5-15x higher than standard models
- Latency 15-180 seconds – not for realtime
- No added value on simple language tasks
- Still hallucinates on factual questions without training knowledge
FAQ
Do I see the reasoning of the models?
Varies. OpenAI o3 hides reasoning completely – user sees only the answer. Claude with Extended Thinking is optionally visible (API flag). The current DeepSeek-R generation is fully visible (research transparency). Gemini 2.5 Pro Thinking optionally visible. Practical consequence: in applications with audit duty (EU AI Act) visible reasoning output is a plus for traceability.
Does reasoning make the model more reliable?
In structured tasks (maths, code, logic) yes – error rate drops markedly. On factual questions from the real world (tax special cases, current legislation) not guaranteed better. Reasoning corrects its own argumentation errors but hallucinates the same facts as standard models when knowledge is missing from training. For reliability on factual questions: combine RAG and reasoning.
How do I steer the thinking budget?
Vendor-specific. OpenAI o3: reasoning_effort = low/medium/high (about 2,000/8,000/30,000 tokens). The current top Claude model: thinking.budget_tokens = 1,024 to 32,000. Gemini 2.5 Pro: thinking_budget. The current DeepSeek-R generation: implicit, no parameter. Rule of thumb: low budget (2-4k) for 80% of requests, high budget (16-32k) only for stubborn problems. Higher budget costs linearly more.
Can I self-host the current DeepSeek-R generation?
Yes. The current DeepSeek-R generation is open-weight (April 2026 release) and available on Hugging Face. Hardware for the 32B variant: 1x H100-80GB. For the 671B variant: 4-8x H100. Quality May 2026: comparable to o3-mini, in some maths benchmarks even better. For compliance-critical CH/EU applications (client files) self-hosted the current DeepSeek-R generation is a strong option – data residency guaranteed.
Related topics
Sources
- OpenAI – Learning to Reason with LLMs (o1 announcement) · 2024-09
- Anthropic – Claude mit Extended Thinking Reference · 2026-05
- DeepSeek – DeepSeek-R1 Technical Report (arXiv:2501.12948, R2 follow-up April 2026) · 2025-01
- Wei et al. – Chain-of-Thought Prompting Elicits Reasoning in LLMs (arXiv:2201.11903) · 2022-01
- Artificial Analysis – Reasoning Models Benchmark Leaderboard · 2026-05