MIXTURE OF EXPERTS · AI CONCEPT

What is Mixture of Experts (MoE)? Sparse models explained May 2026

MoE models activate only a fraction of their parameters per token – comparable quality at 5x less compute. May 2026: Llama 4 Maverick, the current DeepSeek-V generation, Mixtral.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Mixture of Experts?

Mixture of Experts, or MoE, is an architectural variant of large language models in which not all parameters are activated per token. Instead of a single large feedforward network in each transformer layer, there are many "experts" (typically 8, 16, 64 or 128) – and a small router selects for every token which 1-2 experts become active. The model has very many parameters in total, but only a fraction is computed per token.

The opposite term is "Dense" – all parameters are active for all tokens. Llama 3 70B is Dense: per token all 70 billion parameters are traversed. Llama 4 Maverick is MoE: 400 billion parameters total, but only 17 billion active per token. That is a 5x efficiency improvement at comparable (or better) quality – the main reason MoE became mainstream since 2024.

The idea is not new – the first MoE papers come from the 1990s (Jacobs/Jordan 1991). In the LLM world MoE became relevant with Google Switch Transformer in 2017 but stayed research-grade until 2023. The production breakthrough came with Mistral and Mixtral 8x7B (December 2023): the first open-source MoE model to gain mainstream attention. As of May 2026 MoE is standard for top models:

- Llama 4 Maverick (Meta, May 2026): 400B parameters total, 17B active per token, 128 experts, top-1 routing. Predecessor Llama 4 Scout: 109B/17B/16 experts. - the current DeepSeek-V generation (DeepSeek, April 2026): 670B/37B active, 256 experts, very aggressive sparsity (one of the sparsest MoE architectures in May 2026). - Mixtral 8x22B (Mistral, April 2024, still in use): 141B/39B active, 8 experts, top-2 routing. an upcoming Mistral Large generation is closed-source, presumably MoE based on architecture hints. - Gemini 2.5 Pro/Ultra (Google): MoE, details proprietary. - the current top Claude model, the current top GPT model: architecture not officially disclosed, but MoE is suspected in both.

For SME users the most important consequence is: MoE models offer frontier quality at lower API prices. The current DeepSeek-V generation (USD 0.27 input / USD 1.10 output per million tokens) is typically 5-10x cheaper than Claude or GPT in May 2026 at comparable answer quality for standard tasks. That gap is not "vendor being nice" but an architectural consequence of MoE.

Why MoE matters for SMEs

MoE touches SME decisions directly in four areas.

First: price per request. As of May 2026 MoE models (the current DeepSeek-V generation, Mixtral, Llama 4 via self-hosting) are typically 5-10x cheaper than Dense models of the same quality class. Concretely: a fiduciary chatbot with 5,000 requests/month costs around USD 30-50/month on Claude Sonnet (Dense, presumed), USD 3-6/month on the current DeepSeek-V generation (MoE). At comparable standard-request quality. This gap is often overlooked in May 2026 – users stick with OpenAI/Anthropic out of habit, although MoE alternatives solve 80% of tasks equally well.

Second: self-hosting becomes realistic. Llama 4 Maverick (open-weight, MoE, 400B/17B active) needs not 400 GB VRAM for inference but "only" about 200-280 GB (all parameters loaded but only the active ones executed) – fits on 2x H100-80GB or 1x H200-141GB. A 70B Dense model would need similar hardware but deliver 70/400 = 17.5% of the quality. MoE makes self-hosting economical at top quality. A law firm with 30 staff can host Llama 4 Maverick on a 2-GPU server (CHF 60-100k purchase) – no API drain, full data control.

Third: latency vs quality. MoE models often have slightly higher latency per token than Dense models of the same active parameter count because routing has a small overhead. May 2026 practical level: 60-150 tokens per second (Dense 17B vs MoE 400B/17B). For chat applications irrelevant, for latency-critical voicebots possibly a concern.

Fourth: specialisation phenomenon. The individual experts in MoE models develop spontaneous specialisations during training – expert 7 becomes "the code expert", expert 32 "the German expert", expert 89 "the law expert". As of May 2026 this is well documented (the current DeepSeek-V generation paper, Llama 4 paper). Practical consequence: MoE models often have better code and multilingual capabilities than Dense models of the same active parameter count. For fiduciary applications with Excel formula generation and multilingual client communication that is a bonus.

Strategic consequence. Whoever starts a new AI project in May 2026 checks MoE models explicitly as option. Standard evaluation: compare 3 models per task – one Dense top model (Claude, GPT), one MoE open-source (the current DeepSeek-V generation, Llama 4 Maverick), one MoE EU model (an upcoming Mistral Large generation). For 80% of standard SME tasks quality differences are below 5% – the MoE price advantage is factor 5-10.

MoE architecture in detail

A MoE layer replaces the standard feedforward network in a transformer layer with a family of experts plus a router.

Layout of a MoE layer. Instead of one large FFN (e.g. 17 billion parameters) there are N experts (each e.g. 17 billion parameters; N=128 yields 2.176 trillion FFN parameters total – careful, that is the theoretical upper bound, in practice experts are smaller). Plus a small router (a few million parameters) that processes a vector in input vectors for each token and decides which K of the N experts become active. Typical configurations:

- Mixtral 8x7B: 8 experts, top-2 (K=2), 2 of 8 active per token. 47B parameters total, about 12B active. - Mixtral 8x22B: 8 experts, top-2. 141B total, 39B active. - Llama 4 Maverick: 128 experts, top-1. 400B total, 17B active per token. - the current DeepSeek-V generation: 256 experts plus 1 always-active "shared expert", top-8 among the 256. 670B total, 37B active.

Routing algorithm. The router is a simple feedforward layer: input vector (typically 4,096-12,288 dimensions) → logits over N experts → softmax → top-K selection. The K selected experts process the token, their outputs are weighted summed (weighting is the router probability). Per token a different mixture results – tokens for code content tend to land at code experts, German tokens at DE-specialised experts.

Load balancing. A problem: without precaution the router tends to overload a small subset of experts and use others almost never. That is inefficient. Solution: an auxiliary loss during training that enforces even use. May 2026 standard: load-balancing loss from the Switch Transformer paper (Google 2021), plus modern variants like "auxiliary-free load balancing" (DeepSeek V3/V4, 2025).

Inference behaviour. At inference the active expert selection changes per token. Hardware consequence: all experts must be in VRAM/RAM (otherwise disk IO on switch), but only K experts must be computed per token. That is memory-heavy, compute-light. Llama 4 Maverick (400B/17B active) needs about 200-280 GB VRAM (BF16, without quantisation), but computes per token only like a 17B model. With quantisation (see was-ist-quantisierung): 4-bit quantisation brings VRAM to about 80-110 GB, fits on 1-2 GPUs.

Specialisation in training. Without vendors explicitly assigning experts, experts develop specialisations during the pretraining run. The current DeepSeek-V generation documents: expert #23 is "Python-specialised" (over 60% activations on Python code), expert #87 is "Chinese-language specialist". These specialisations emerge from load-balancing pressure and statistical patterns of training data – are not hard-coded but stable.

Practical vendor trends May 2026. Mistral and DeepSeek use aggressive MoE sparsity (16-256 experts, top-1 or top-8). Meta (Llama 4) has moderate sparsity (16-128 experts, top-1). Google Gemini is MoE, details proprietary. Anthropic Claude and OpenAI GPT: architecture not disclosed. As of May 2026 much points to hybrid architectures – MoE mixed with Dense layers, or MoE only in certain layer blocks – but that is proprietary and not publicly verifiable.

Understand MoE in 5 steps

01Distinguish Dense and Sparse: Dense computes all parameters per token, Sparse (MoE) only a fraction (typically 5-10%).
02Understand the vendor landscape May 2026: Llama 4 Maverick (400B/17B), the current DeepSeek-V generation (670B/37B), Mixtral 8x22B (141B/39B), an upcoming Mistral Large generation (closed-source MoE).
03Check per task: standard SME tasks (chat, triage, summary) prefer MoE for price; top-end reasoning prefer Dense.
04Check self-hosting hardware needs: 400B/17B MoE needs 2x H100-80GB minimum, smaller models (109B/17B) suffice on 1x H100.
05Make the API-vs-self-hosting decision: at < 100k tokens/day API cheaper, at > 1M tokens/day check self-hosting.

When MoE is the right choice

Three clear SME scenarios for MoE models.

Scenario 1: high request volume, cost sensitivity. When you have 1,000+ requests per day (client chat, email triage, RAG answers) and token cost becomes a real factor, the current DeepSeek-V generation or Mixtral is the cheapest high-quality option in May 2026. At 30,000 requests per month (mid-size fiduciary), the cost difference is USD 50-200/month in favour of MoE – 600-2,400 per year. Over 5 years relevant.

Scenario 2: self-hosting for data residency. When revDSG, EU AI Act or professional secrecy (Art. 321 SCC) forces you not to send client data to US cloud, self-hosting is mandatory. As of May 2026 the realistic self-hosting models for frontier quality are all MoE: Llama 4 Maverick (400B/17B), Llama 4 Scout (109B/17B), Mixtral 8x22B, DeepSeek V3.1 open-weight. Dense models with comparable quality do not exist in open-weight form above 70B in May 2026.

Scenario 3: multilingual applications. MoE models typically have better multilingual performance in May 2026 than Dense models of the same active parameter count. Reason: experts specialise on single languages or language families. For Swiss SMEs with DE/FR/IT/EN mix (fiduciary, insurance, tourism) this is a concrete advantage. Mistral and DeepSeek score here; Llama 4 is more solid than Llama 3 but still rather English-centric.

Scenario 4: code generation. MoE models typically have dedicated code experts and score well in code benchmarks (HumanEval, MBPP, SWE-Bench). For fiduciary applications with Excel formula generation, SQL query creation or API scripts, the current DeepSeek-V generation or Codestral (Mistral code model) is an alternative to Claude/GPT at factor 3-10 lower cost.

Scenario 5: burst loads. When you have irregular peak phases (e.g. tax season February-April), the cheap API component of MoE models is practical – no "must buy API volume tier". Pay-per-token models scale linearly with peak phases.

When MoE is not the best choice

Three cases in which Dense models are preferred over MoE.

First: highest reasoning quality on hardest tasks. As of May 2026 the hardest mathematics and reasoning benchmarks (FrontierMath, GPQA Diamond, MATH-500) are still led by Dense top-class models (Claude Opus, the current top GPT model Pro) and pure reasoning models (see was-ist-reasoning-modell). MoE brings 80-95% of that quality at 10-20% of the price – sufficient for 95% of SME tasks, but not for the remaining 5% top-dog cases.

Second: ultra-low latency. When you need under 100ms time-to-first-token (e.g. realtime voicebot), the top MoE models (400B+/17B active) are too slow – router logic plus VRAM IO bandwidth pushes latency to 200-400ms. Small Dense models (Claude Haiku, the current top GPT model Mini, Gemini Flash) often deliver 80-150ms TTFT here. For latency-critical realtime applications, prefer Dense.

Third: VRAM-tight self-hosting. When you have only a single A100-40GB, you cannot load a 400B MoE model at all (200+ GB minimum VRAM). Here a 13B Dense model (Llama 3.1 13B, Mistral 7B Instruct) is the right choice – smaller but fully in VRAM. Only from 2x H100-80GB or 1x H200 onward does MoE self-hosting pay off.

Trap "MoE is always cheaper in self-hosting". True only when hardware is in place. A 400B MoE model needs 200-280 GB VRAM (BF16) or 80-110 GB (4-bit quantisation). 2x H100-80GB is around CHF 60-80k purchase plus 4-6k/year electricity and cooling. Over 5 years that is CHF 80-110k total cost of ownership. Whoever processes less than 100k tokens per day rides cheaper on API.

Trap "all MoE models are equal". As of May 2026 there are large quality differences between MoE models. Llama 4 Maverick is solid but not top in DE; the current DeepSeek-V generation is very strong mathematically but somewhat formal; Mixtral 8x22B is reliable but 18 months old; an upcoming Mistral Large generation is closed-source and unclear whether MoE. Per-task evaluation remains mandatory.

Trap "MoE understands architecture magic". End users only care if the model solves the task. Whether Dense or MoE is an implementation detail. Marketing language like "modern MoE architecture" should not obscure actual task evaluation.

Trade-offs

STRENGTHS

Frontier quality at 10-20% of Dense price
Self-hosting realistic for 100-400B models with 1-2 GPUs (after quantisation)
Better multilingual and code abilities than Dense models of same active parameter count
Specialisations emerge spontaneously during training

WEAKNESSES

High VRAM demand – all experts must be loaded
Slightly higher latency per token than Dense of same active parameter count
More complex training algorithm (load balancing, router stability)
Behind top Dense and reasoning models on hardest reasoning benchmarks in May 2026

FAQ

Why do MoE models need so much VRAM if only 17B are active?

Because all experts must be available in VRAM – the router can pick any per token. Disk IO on each pick would be far too slow. Practical consequence Llama 4 Maverick: 400B parameters fully in VRAM (200-280 GB BF16), but only 17B of them executed per token. Memory-heavy, compute-light. With 4-bit quantisation VRAM shrinks to 80-110 GB, fits on 1-2 GPUs.

Is the current DeepSeek-V generation really so much cheaper than Claude?

Yes. May 2026: the current DeepSeek-V generation USD 0.27 input / USD 1.10 output per 1M tokens. Claude Sonnet USD 3 input / USD 15 output. Factor 11-14 difference. At comparable standard-request quality. Background: MoE architecture plus aggressive sparsity (256 experts, top-8), plus Chinese datacenter with lower electricity costs. Check compliance (DeepSeek is a Chinese vendor, for EU/CH fiduciary with client data sensitive) – for internal tools (code help, triage without PII) often a good choice.

Are Claude and GPT MoE or Dense?

Neither vendor has officially disclosed the architecture. Indirect indicators (performance per USD, latency profiles, model family structure with Haiku/Sonnet/Opus or Mini/Pro) strongly suggest MoE or hybrid architectures in May 2026. OpenAI has publicly confirmed GPT-4 as MoE (8 experts, top-2). For the current top GPT model and the current top Claude model it is suspected but not confirmed. Practically for SMEs: irrelevant, what counts is task evaluation.

Can I steer experts deliberately?

No, not via standard APIs. Routing is internal and not user-controllable. In research there are attempts in May 2026 to address experts deliberately ("expert steering"), but it is not available in production APIs. Practically: you influence specialisation via the prompt – "answer as a tax advisor" lets the router route tokens to corresponding experts without you controlling the expert choice.

Sources

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call