fairlane.systems

LLAMA 4 · TECH

Llama 4 Scout and Maverick: Meta's MoE family with 10M context and 400B total parameters

Llama 4 Scout (17B active, 10M context) and Maverick (17B active, 128 experts, 400B total) – released 5 April 2026 under the Llama Community License.

Researched & fact-checked by: · As of: 2026-05

What is Llama 4?

Llama 4 is the fourth generation of Meta AI's open-weight language model family, released on 5 April 2026. Unlike the predecessors (Llama 2, Llama 3, Llama 3.1, Llama 3.3), all Llama 4 models are designed as mixture-of-experts (MoE). The licence is the Meta Llama Community License – commercially usable for companies below 700 million monthly active users, which practically covers all Swiss fiduciary and law office setups. Important: this is NOT Apache 2.0 but its own licence with specific usage clauses.

The Llama 4 family comprises three models as of May 2026. Llama 4 Scout (17B active parameters, 109B total, 16 experts) is the edge and single-GPU variant. Llama 4 Maverick (17B active parameters, 400B total, 128 experts) is the flagship variant for serious reasoning workloads. Llama 4 Behemoth (288B active, 2T total) was announced as of May 2026 but not yet released – a model at the scale of the current top GPT model and the current top Claude model, intended for research use.

The technically remarkable aspect: Scout has a context window of 10 million tokens. That is by far the largest productive context in the open-weight space in May 2026 and even against closed-source models (the current top GPT model 1M, the current top Claude model 1M, Gemini 2.5 2M), it is a factor of 5 ahead. Practical meaning: entire contract corpora or multi-year client files fit in a single prompt without a RAG pipeline.

Training data: Llama 4 was trained on around 30 trillion tokens, with – per Meta – over 200 languages, with markedly better multilingual capability than Llama 3. German, French, Italian and English are productively strong; Romansh is not officially trained (Apertus remains the right choice here). Native tool use, vision capability (multimodal pre-training with images), and JSON output with grammar constraints are integrated.

Availability May 2026: Hugging Face (meta-llama/Llama-4-Scout-17B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct), Meta's own hosted variant via llama.com, AWS Bedrock, Azure AI Foundry, Google Cloud Vertex, Groq, Together AI, Fireworks AI. Local self-host options: vLLM, Text Generation Inference, Ollama, llama.cpp.

Why Llama 4 matters for Swiss data

Llama 4 has four concrete usage arguments for Swiss fiduciary and law setups, plus two important caveats.

First: long context as a game-changer. Scout with 10M tokens solves an old problem for legal practices. A complete file with all pleadings, contracts, email correspondence and court decisions over seven years typically fits in 2-5 million tokens. With Llama 4 Scout this file can be loaded into a single prompt and queries like "Which client statements about asset position contradict later pleadings?" can be answered directly. A RAG pipeline with embeddings and chunk search is not needed here. That simplifies the architecture markedly.

Second: self-host capability. Whoever runs Llama 4 Scout on two H100 80GB themselves (in 4-bit AWQ quantisation it fits on one H100, with tensor-parallel on two H100 for higher throughput) keeps full data sovereignty. Relevant for mandates under professional secrecy per Art. 321 SCC. Maverick, by contrast, demands eight H100 – operationally sensible only for large firms.

Third: vision capability out-of-the-box. Llama 4 is natively multimodal: images can be passed directly in the prompt. For contract photo scanning, signature recognition, OCR pre-processing and receipt classification in fiduciary workflows this is practical – no separate vision model needed.

Fourth: multilingual improvements. Llama 4 is markedly better in German, French and Italian than Llama 3.3. Mistral Large 2 remains slightly ahead for EU languages, but the gap is small. For Swiss fiduciary setups that do not primarily need Romansh, Llama 4 is a serious option.

Caveat one: the Llama Community License is NOT Apache 2.0. There is an Acceptable Use Policy with concrete restrictions (no illegal activity, no discrimination, no manipulation of critical infrastructure). A 700M MAU clause demands a separate licence from Meta for very large firms – not relevant for Swiss SMEs. Still, the licence must be reviewed before any self-host setup.

Caveat two: Llama 4 has no Swiss origin. Anyone valuing data sovereignty with CH origin (strict FINMA setups, mandates involving Swiss group holdings) stays with Apertus 70B as the first choice. Llama 4 as self-host in your own CH data centre comes very close, but the training data provenance point remains.

Llama 4 in practice

Model architecture. Llama 4 is an MoE transformer. For each token prediction, a "router" selects a small subset of experts (typically 2 out of 16 in Scout, 2 out of 128 in Maverick). That means only a fraction of weights is active per forward pass. Result: inference cost scales with the "active parameters" number (17B), not with total parameters (109B or 400B). Hence the sweet spot: high quality, moderate inference cost.

Setup example with vLLM on two H100 80GB:

``` docker run --gpus all -p 8000:8000 \ vllm/vllm-openai:v0.6.3 \ --model meta-llama/Llama-4-Scout-17B-Instruct \ --max-model-len 1000000 \ --tensor-parallel-size 2 \ --quantization awq \ --gpu-memory-utilization 0.93 ```

With --max-model-len 1000000 (1M tokens) a significant reserve remains for KV cache per request. For the full 10M context, --enable-prefix-caching must be set and GPU memory actively managed – in practice, 1-2M tokens per request is the sensible range.

Maverick on 8x H100. Llama 4 Maverick at 400B total parameters is large. In 4-bit quantisation it needs around 220 GB VRAM active, plus KV cache. Setup on an H100 SXM5 box (8x 80GB = 640 GB VRAM) with --tensor-parallel-size 8 is the usual configuration. Operating cost May 2026: around CHF 25,000-35,000 per month in rent or roughly CHF 250,000-350,000 for hardware purchase.

With Ollama on RTX 4090. Llama 4 Scout in 4-bit quantisation just fits on an RTX 4090 24GB:

``` ollama pull llama4:scout ollama run llama4:scout "Briefly explain FADP Art. 6" ```

Ollama recognises the MoE architecture automatically. Performance: around 30-50 tokens/sec, context comfortable up to 128k, up to 1M with adapted KV cache management.

Tool use. Llama 4 has native tool calling in the OpenAI schema. An example prompt with a JSON schema for a Bexio receipt query works directly. This makes Llama 4 attractive for n8n workflows and LangChain agents.

Vision capability. Llama 4 accepts images as Base64-encoded JSON fields or URLs per the OpenAI Vision specification. Practical example: upload a photo of a receipt, "classify this receipt as Entertainment, Travel, Office, Client Disbursement" – Llama 4 reads the receipt and answers in structured form.

Hosting May 2026. Three productive paths. Path 1: self-host in own CH rack on two H100s – full sovereignty, high initial cost. Path 2: rental option via Infomaniak GPU instances – Swiss data residency, medium cost. Path 3: cloud API via AWS Bedrock Frankfurt or Together AI EU – fastest start, lower sovereignty.

Llama 4 to production in 5 steps

  1. 01Licence check: read the Meta Llama Community License, align the Acceptable Use Policy with the planned use case, involve compliance and legal if needed.
  2. 02Model choice: Scout for long context and single-GPU setups, Maverick for top reasoning on 8x H100, Behemoth not productive (May 2026).
  3. 03Hosting path: self-host via vLLM on two H100s (sovereignty), Infomaniak GPU instances (CH residency), AWS Bedrock Frankfurt (EU residency).
  4. 04Integration: configure Hugging Face token, load the model via meta-llama/Llama-4-Scout-17B-Instruct under vLLM or TGI, LiteLLM proxy in front.
  5. 05Use-case test: run 50-100 real client-typical queries against Apertus 70B, Mistral Large 2 and Llama 4 Scout, measure hit rate, derive routing rules.

When to use Llama 4

Llama 4 Scout is the right choice when (a) very long contexts are needed (multi-year files, complete contract corpora), (b) vision capability out-of-the-box is desired, or (c) multilingual DE/FR/IT quality with GPU self-operation is required.

Concrete cases: law firm with multi-year client files – Llama 4 Scout on two H100s with 1-2M tokens per request, full files without RAG pipeline. Fiduciary group with receipt photo workflow – Llama 4 Scout with vision capability for direct receipt classification from smartphone photos. Consulting boutique with complex contracts – Scout for long-context analysis, Maverick for difficult comparison cases.

Llama 4 Maverick is the right choice for (a) very demanding reasoning that requires the current top GPT model or the current top Claude model level, (b) multi-language requirements with high precision, and (c) sufficient GPU budget (8 H100). Maverick is on par with GPT-4o on many benchmarks and close to Claude 3.5 Sonnet – by 2026 standards that is the upper-middle range in closed-source comparison.

When not to use

If you need Romansh or Schwizerdütsch capability, Apertus 70B is the right choice, not Llama 4. Meta has no Romansh officially in the training data – quality is correspondingly weak.

If licence hygiene takes precedence in compliance review and Apache 2.0 is the desired standard, Apertus, the Mistral Small Apache variant or Phi-4 are the cleaner options. The Llama Community License is commercially usable but not as simple as Apache 2.0 or MIT.

For pure throughput workloads with short queries (classification, triage), Maverick at 400B total parameters is oversized. Here Apertus 8B, Phi-4 or Mistral Small 3.1 are more efficient – same quality on the specific task, markedly lower GPU load.

For setups where US origin is perceived as problematic (strict FINMA mandates, federal administrations, holdings with exclusively EU data policy), Apertus or Mistral remains the right choice. Self-host softens the argument but does not fully resolve it (training data origin remains US/Meta).

Trade-offs

STRENGTHS

  • 10M token context on Scout – entire client files in a single prompt
  • Native multimodal capability for vision tasks without a separate model
  • MoE architecture – inference cost scales with active parameter count (17B), not total (109B/400B)
  • Self-hostable on two H100s (Scout) – full data sovereignty possible

WEAKNESSES

  • Llama Community License – commercially usable but not as simple as Apache 2.0 or MIT
  • No official Romansh training – for CH-RM cases Apertus remains the right choice
  • Maverick demands 8 H100 or 4 H200 – operationally expensive for small to medium offices
  • Training data provenance at Meta – for strictly sovereign Swiss setups Apertus stays ahead

FAQ

What is the practical value of 10M context?

As of May 2026, 10M context is still experimental – KV cache memory and reasoning quality across such long distances are not perfect in every request. Realistic productive: 1-2M tokens per request, which covers complete client files or multi-year contract collections. Advantage over RAG pipelines: no chunk strategy, no retrieval latency, all information in one context.

How does Llama 4 stack up against Apertus 70B?

On MMLU and general reasoning, Llama 4 Scout sits slightly ahead of Apertus 70B (around 81 vs 80 points). On long-context tasks, Llama 4 is clearly ahead with 10M. On Romansh and Schwizerdütsch, Apertus is clearly ahead. On training data transparency, Apertus is clearly ahead (fully disclosed vs Meta-internal). For Swiss fiduciary standard work, Apertus is the more natural choice; for complex long-context cases, Llama 4 Scout.

Do I really need 8 H100s for Maverick?

For comfortable productive load yes. Maverick in 4-bit quantisation theoretically fits on 4 H100 80GB (220 GB VRAM active plus KV cache) but tensor-parallel split across 4 GPUs creates more communication overhead than across 8. Operational practice: 8x H100 SXM5 in a DGX-like configuration, alternatively 4x H200 (141 GB per GPU) as a more compact solution. Rental option: AWS p5.48xlarge or comparable at Together AI.

What EU AI Act duties apply to Llama 4?

Llama 4 is a general-purpose AI model per the EU AI Act. Art. 50 requires transparency: the model must be flagged as an AI system, technical documentation must be available. Meta provides a model card and a technical report – that covers the model-side duties. The operator must additionally classify the use case (check Annex III), conduct a conformity assessment if high-risk, and prepare a DPIA per GDPR Art. 35 / FADP Art. 22.

Related topics

APERTUS · COMPLIANCEApertus: the open Swiss AI model from ETH Zurich, EPFL and CSCS – status May 2026OPEN-WEIGHT MODELS - COMPARISONOpen-weight models compared: Llama 3.3/4, Mistral, DeepSeek, Qwen, Gemma, Phi-4, Command R, Falcon, GLM, ApertusVLLM · TECHvLLM: production serving for open-weight LLMs with high throughput and PagedAttentionTGI · TECHText Generation Inference (TGI): production serving from the Hugging Face universeMISTRAL LARGE · TECHMistral Large 2 and Mistral Small 3.1: the EU model pair with FR/DE/IT strengthOLLAMA · TECHOllama: local LLMs on your own hardware – where it works and where it does notSELF-HOSTED VS. CLOUD · AI CONCEPTSelf-hosted vs. cloud LLM: a decision framework for SMEs and fiduciaries

Sources

  1. Meta – Introducing Llama 4 (official blog, 5 April 2026) · 2026-04
  2. Llama 4 Scout – Hugging Face model card · 2026-05
  3. Llama 4 Maverick – Hugging Face model card · 2026-05
  4. Meta Llama Community License (current version) · 2026-04

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call