fairlane.systems

QWEN 3 · TECH

Qwen 2.5 and Qwen 3: Alibaba's open-weight family with maths and code strength

Qwen 2.5 and Qwen 3 from Alibaba Cloud. Apache 2.0 for smaller models. Strongly multilingual, leading in maths and code. Self-host resolves the PRC concern.

Researched & fact-checked by: · As of: 2026-05

What is Qwen?

Qwen (from Chinese "Tongyi Qianwen", "1000-fold answer") is Alibaba Cloud's open-weight language model family. First releases in 2023, with Qwen 2.5 as the mature family since summer 2024 and Qwen 3 as the next generation since early 2026.

The Qwen family is broad. As of May 2026, Qwen 2.5 includes models in sizes 0.5B, 1.5B, 3B, 7B, 14B, 32B and 72B, each in base and instruct variants. Qwen 3, released in the first wave in March 2026, additionally offers MoE variants (Qwen3-30B-A3B with 3B active parameters, Qwen3-235B-A22B with 22B active) and improved reasoning via a "Thinking Mode" – similar to the DeepSeek-R1 or o3-mini approach, where the model runs an extensive reasoning step before the actual answer.

Licence situation May 2026: models up to and including Qwen2.5-72B are Apache 2.0. Models above 72B (such as Qwen2.5-72B-Plus, some Qwen 3 premium variants) sit under the Tongyi Qianwen License, a custom licence with commercial use up to 100 million monthly active users. For Swiss SMEs the 100M threshold is irrelevant – both licences are commercially usable in practice.

Availability: Hugging Face (Qwen/Qwen2.5-72B-Instruct, Qwen/Qwen3-30B-A3B-Instruct etc.), Alibaba Cloud DashScope API (with Chinese or Singapore hosting), Together AI, Fireworks AI, plus self-host via vLLM, TGI, Ollama, llama.cpp.

The Qwen family additionally includes specialised variants: Qwen2.5-Coder for programming (near top frontier models on SWE-Bench as of May 2026), Qwen2.5-Math for mathematics (top among open-weight on MATH and AIME), QwenVL for vision-language and Qwen-Audio for speech processing. This specialisation is a clear differentiator in May 2026.

Why Qwen matters for Swiss data

Qwen has four concrete arguments and two important caveats for Swiss setups as of May 2026.

First: maths and code as class-best. Qwen2.5-Math and Qwen2.5-Coder clearly beat all other open-weight families on the respective benchmarks. For a fiduciary firm with complex tax calculation pipelines (special VAT cases, international transfer pricing, pension fund mathematics), Qwen2.5-Math is a productive help. For an internal tool development team having scripts and microservices generated, Qwen2.5-Coder is on the level of Claude Sonnet in the code area.

Second: multilingual strength and Qwen3 Thinking Mode. Qwen is trained competently in around 30 languages, with particular focus on Mandarin (logical for Alibaba), English, German, French, Spanish and Japanese. On German, Qwen is productively usable in May 2026, though not quite at Mistral level. Qwen3 Thinking Mode delivers results on hard logic tasks close to frontier models – interesting for complex fiduciary or legal reasoning cases.

Third: Apache 2.0 for relevant sizes. Qwen 2.5 up to 72B as Apache 2.0 is the cleanest licence configuration. Self-host without commercial restrictions, fine-tuning allowed, modification allowed. Attractive for SME compliance setups.

Fourth: specialised models cover specific use cases optimally. Whoever needs receipt-photo processing with a vision-language model has a top option in QwenVL. Whoever needs internal code assistance has a premium variant with Qwen2.5-Coder. This specialisation saves the search for individual best-of-breed models.

Caveat one: PRC origin. Alibaba is a Chinese company. With direct API use through DashScope, requests go either to the mainland China data centre or the Singapore centre. For Swiss clients under professional secrecy per Art. 321 SCC this is excluded – and for GDPR-compliant setups too, third-country transfer entails additional TIA duties. Self-host via Hugging Face solves the problem: the weights are open-weight and run in own rack without any request reaching Alibaba.

Caveat two: political risk and sanctions situation. The US-EU-PRC relationship is tense in May 2026. Future sanctions restricting Hugging Face downloads of PRC models cannot be ruled out. Whoever builds on Qwen should have a backup strategy (secure model weights locally, keep Apertus or Mistral as plan-B models ready).

Qwen in practice

Architecture. Qwen 2.5 is a dense transformer-decoder model with grouped-query attention, rotary position embeddings and SwiGLU activation – very similar to the Llama family. Context window on main models 128k tokens. Qwen 3 introduces MoE variants (Qwen3-30B-A3B: 30B total, 3B active) and the Thinking Mode, activated via the /think prompt tag, which prompts the model to run an extensive internal reasoning step.

Setup example with Ollama. Qwen 2.5 72B on two H100 or one H100 with quantisation:

``` ollama pull qwen2.5:72b-instruct-q4_K_M ollama run qwen2.5:72b-instruct-q4_K_M "Calculate Swiss VAT on EUR 12500 at 8.1 percent." ```

Performance: on two H100 80GB in 4-bit AWQ around 30-50 tokens/s, on one H100 with GGUF Q4_K_M around 15-25 tokens/s.

Setup example with vLLM. Qwen3-30B-A3B (MoE) on one H100:

``` docker run --gpus all -p 8000:8000 \ vllm/vllm-openai:v0.6.3 \ --model Qwen/Qwen3-30B-A3B-Instruct \ --max-model-len 32768 \ --gpu-memory-utilization 0.92 \ --enable-prefix-caching ```

MoE architecture means: 30B total parameters, but only 3B active per token. Result: low inference cost on a single H100, comparable to a dense 3B model – but quality more on the 14B level.

Thinking Mode Qwen 3. Activation via system prompt:

``` System: /think User: Solve this task step by step: A company has 12 employees, of whom 3 are part-time. Per full-time position CHF 500 pension contribution per month can be claimed. What is the annual sum? ```

The model first generates a "thinking block" with extensive reasoning, then the final answer. The thinking block is visible in the output and can be logged for audit purposes – important for EU AI Act Art. 15 logging duties.

Code workflow with Qwen2.5-Coder. Qwen2.5-Coder exists in 7B, 14B and 32B. On an RTX 4090, Qwen2.5-Coder-32B runs in 4-bit AWQ quantisation at 30-50 tokens/s – productive for internal code assistance. Integration via Continue.dev or VS Code extension with an OpenAI API endpoint.

RAG setup with Qwen embeddings. Qwen has its own embedding models (Qwen3-Embedding-8B leading on MTEB benchmark as of May 2026). In a Swiss RAG pipeline, Qwen3-Embedding-8B can be loaded into LocalAI or TEI, with Qdrant as the vector DB and Apertus 70B or Mistral Large 2 as the generator model.

Hosting recommendation. Apache 2.0 Qwen models run best self-host: Hugging Face download once, then fully offline-capable. For EU/CH compliance, this variant is the clean choice. DashScope API via Singapore can be an option for non-sensitive workloads, but a GDPR TIA and FADP third-country review are mandatory.

Qwen to production in 5 steps

  1. 01Compliance check: assess PRC origin in the context of client sensitivity and compliance policy. Self-host is the clean variant; API use via DashScope requires TIA and FADP third-country review.
  2. 02Model choice: Qwen2.5-72B as general workhorse (Apache 2.0), Qwen2.5-Math for maths, Qwen2.5-Coder-32B for code, Qwen3-30B-A3B for efficient MoE inference, Qwen3 Thinking variant for complex reasoning.
  3. 03Hardware check: Qwen2.5-72B in 4-bit AWQ needs ca. 45 GB VRAM (one H100 or two RTX 4090). MoE variants are memory-efficient – Qwen3-30B-A3B fits an RTX 4090.
  4. 04Self-host via vLLM or Ollama, OpenAI-compatible endpoint, LiteLLM in front with logical model names (qwen-math-local, qwen-coder-local, qwen3-thinking-local).
  5. 05Use-case benchmark against Apertus 70B, Mistral Large 2 and the current top Claude model. Derive routing rules from it: maths-intensive queries to Qwen-Math, code generation to Qwen-Coder, sensitive CH language to Apertus, general reasoning to Mistral or Claude.

When to use Qwen

Qwen is the right choice when (a) maths or code-specialised workloads are central, (b) Apache 2.0 licence is central, or (c) Qwen3 Thinking Mode is needed for complex reasoning.

Concrete cases: fiduciary with complex tax calculation pipelines – Qwen2.5-Math as self-host for maths-intensive workloads. Software consulting boutique with internal code generation – Qwen2.5-Coder-32B on an RTX 4090. SME with a RAG setup wanting to keep embeddings local – Qwen3-Embedding-8B in LocalAI or TEI.

For Swiss setups that exclusively self-host and prioritise an Apache 2.0 licence, Qwen 2.5 up to 72B is a direct competitor to Apertus 70B. Apertus is ahead on CH-specific language; Qwen is ahead on maths and code. A multi-provider strategy with both is sensible.

When not to use

For setups with highly sensitive client data (professional secrecy, strict FINMA mandates), the DashScope API variant is excluded – even Singapore hosting remains third-country. Self-host stays open, but the supra-political argument (PRC origin) must be addressed in internal discussion. Whoever does not want this goes to Apertus, Mistral or Llama 4 as alternatives.

For Romansh or Schwizerdütsch workloads, Qwen is not trained. Apertus remains the right choice here.

For top frontier reasoning at the peak (math olympiad level beyond Qwen2.5-Math, complex legal four-step argument), the current top Claude model or the current top GPT model are still ahead. Qwen3 with Thinking Mode is close but not quite at frontier level.

For setups where a US or EU provider commitment is desired for compliance reasons (e.g. a Swiss bank that only accepts Western providers), Qwen is the wrong choice irrespective of technical quality.

Trade-offs

STRENGTHS

  • Apache 2.0 for models up to 72B – clean licence for commercial self-host setups
  • Class-best on maths (Qwen2.5-Math) and code (Qwen2.5-Coder)
  • Qwen3 Thinking Mode delivers top results on hard reasoning cases
  • MoE variants (Qwen3-30B-A3B) are memory- and cost-efficient at good quality

WEAKNESSES

  • PRC origin – API use via DashScope unsuitable for professional secrecy mandates
  • Romansh and Schwizerdütsch not trained – Apertus stays ahead for CH language
  • Political risk from potential sanctions – backup strategy needed
  • German productive but not quite at Mistral level for legal precision

FAQ

What distinguishes Qwen 2.5 from Qwen 3?

Qwen 2.5 is the mature dense family, broadly tested, stable. Qwen 3 brings three important additions in May 2026: MoE architecture for efficient inference (Qwen3-30B-A3B, Qwen3-235B-A22B), Thinking Mode for step-by-step reasoning generation, and improved multilingual capability. For standard workloads, Qwen 2.5 suffices; for reasoning-intensive cases, Qwen 3 pays off.

Is Qwen via DashScope FADP-compliant?

Conditionally. DashScope offers Singapore hosting – a third country relative to Switzerland and the EU. A TIA (transfer impact assessment) is mandatory; a legal basis per FADP Art. 16-18 (or GDPR Art. 44-49) must be in place. Standard contractual clauses per EU template are available via DashScope. For highly sensitive client data, self-host remains the clean choice; for non-sensitive workloads (public texts, generic code generation), DashScope is usable.

What performance does Qwen2.5-Math deliver?

On the MATH benchmark in May 2026, Qwen2.5-Math-72B achieves around 85 points, markedly ahead of Llama 3.3 70B (around 56) and Apertus 70B (around 62). On AIME (American Invitational Mathematics Examination), Qwen2.5-Math-72B with Thinking Mode delivers results comparable to Claude 3.5 Sonnet. For fiduciary maths workloads (pension, tax calculation, transfer pricing), Qwen2.5-Math is clearly the best open-weight choice.

How secure is long-term availability?

The open-weight weights are freely available via Hugging Face and many mirrors. As of May 2026, they are not subject to sanctions. Risk scenario: future US or EU sanctions could restrict Hugging Face hosting of PRC models or affect software supply chains. Precaution: secure model weights locally once (e.g. via huggingface-cli), backup model strategy with Apertus or Mistral, annual compliance clause review.

Related topics

APERTUS · COMPLIANCEApertus: the open Swiss AI model from ETH Zurich, EPFL and CSCS – status May 2026OPEN-WEIGHT MODELS - COMPARISONOpen-weight models compared: Llama 3.3/4, Mistral, DeepSeek, Qwen, Gemma, Phi-4, Command R, Falcon, GLM, ApertusDEEPSEEK · TECHDeepSeek (V and R lines): the Chinese MoE reasoning model with self-host optionMISTRAL LARGE · TECHMistral Large 2 and Mistral Small 3.1: the EU model pair with FR/DE/IT strengthVLLM · TECHvLLM: production serving for open-weight LLMs with high throughput and PagedAttentionOLLAMA · TECHOllama: local LLMs on your own hardware – where it works and where it does notSELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciaries

Sources

  1. Qwen – official model collection on Hugging Face · 2026-05
  2. Qwen 3 – release notes and Thinking Mode introduction · 2026-03
  3. Qwen2.5-Math – paper and benchmarks · 2026-04
  4. Alibaba Cloud DashScope – Qwen API documentation · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call