APERTUS · TECH
Apertus as a tool: operating the Swiss LLM from ETH, EPFL and CSCS in practice
Apertus 8B and 70B under Apache 2.0. Self-host, Swisscom API or Hugging Face. 15T tokens, 1000+ languages including Swiss German and Romansh.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is Apertus from a tool perspective?
Apertus is the first fully open Swiss language model, developed by ETH Zurich, EPFL and the Centro Svizzero di Calcolo Scientifico (CSCS) and released on 2 September 2025. Availability as of May 2026: two model sizes (Apertus-8B and Apertus-70B), both under Apache 2.0, both accessible via Hugging Face (swiss-ai/Apertus-8B-Instruct, swiss-ai/Apertus-70B-Instruct), the Swisscom API and the Public AI Network. The official portal is apertus.ethz.ch.
This article views Apertus as a tool – i.e. how it is deployed in a productive CH setup, what hardware it needs, which runtime is the right one, how it integrates into LiteLLM routing, how monitoring looks. The strategic compliance perspective is treated separately in the article "Apertus Swiss AI Modell".
Training data: around 15 trillion tokens, of which 40 percent non-English. Over 1,000 languages are in the training corpus, with particular focus on the Swiss national languages: Standard German with CH helvetisms, French (incl. Suisse Romande peculiarities), Italian (incl. Ticino peculiarities), Romansh in five idioms (Sursilvan, Surmiran, Puter, Vallader, Rumantsch Grischun) and Schwizerdütsch in the main dialects (Bernese, Zurich, Wallis, Grisons). This CH-specific language coverage is unique as of May 2026 – no other frontier model, open or closed, has this profile.
Architecture: transformer decoder, very similar to Llama 3, with grouped-query attention and rotary position embeddings. Context window: 128k tokens for both variants. Vocabulary: around 256k tokens, optimised for European languages plus Romansh. Training pipeline: first stage pre-training on 15T tokens, then supervised fine-tuning (SFT) and direct preference optimisation (DPO) for chat capability, RLHF component from ETH-internal annotator teams.
The available quantisations (May 2026) on Hugging Face: FP16 (original), AWQ 4-bit, GPTQ 4-bit, GGUF Q4_K_M, GGUF Q5_K_M, GGUF Q6_K and GGUF Q8_0. This breadth allows setups from edge hardware (Apertus 8B Q4 on RTX 4060 with 8 GB VRAM) to production GPU clusters (Apertus 70B FP16 on two H100 with tensor-parallel).
Why Apertus matters operationally
Apertus is the most-discussed model in Swiss fiduciary and law setups as of May 2026. Five operational reasons.
First: Apache 2.0 licence without clauses. No commercial restrictions, no MAU threshold, no research variant with different licence for commercial use. Self-host, fine-tuning, modification, commercial redistribution – all allowed. For SME compliance, Apertus is the cleanest constellation.
Second: CH language domain. Whoever has client correspondence in Schwizerdütsch, contracts in the Italian Ticino variant or advice in Romansh finds no other productive model at this level. Apertus covers this language mix as the only frontier model. For fiduciary in Grisons, advisory in the Engadin region, Ticino family law mandates, Apertus is the only operational option.
Third: three hosting paths with different cost-sovereignty profiles. Path 1 (Swisscom API): commercial API with CH data residency, lowest entry barrier, prices around CHF 0.4-1.5 per 1M tokens. Path 2 (Infomaniak GPU instance): self-host on rented Swiss GPU, medium cost (CHF 6,000-12,000/month), full compliance control. Path 3 (own rack): higher initial investment (CHF 80,000-150,000 for 2x H100), maximum sovereignty.
Fourth: tight integration into CH cloud stack. Swisscom as service provider is a recognised partner in the Swiss market with own DPA templates and SLA structures. Infomaniak as a second anchor offers GPU instances in Geneva with Swiss data protection. This infrastructure is established in productive CH fiduciary and law setups as of May 2026.
Fifth: training data transparency. ETH/EPFL publish training data sources, training setup and evaluation suite fully. Apertus is the world's best-documented frontier model as of May 2026 – for FINMA SN 08/2024 Pillar 3 (model validation) and EU AI Act Art. 50 (transparency for GPAI) an operational advantage. DPIA and model validation reports are much faster to create with Apertus than with US or PRC models.
Deploying Apertus: setup, runtime, integration
Path 1: Swisscom API. Fastest start. Sign a Swisscom contract, get an API key, query:
```python import os from openai import OpenAI client = OpenAI( base_url="https://api.swisscom.ch/apertus/v1", api_key=os.environ["SWISSCOM_APERTUS_KEY"] ) response = client.chat.completions.create( model="apertus-70b", messages=[{"role": "user", "content": "What is Art. 321 SCC?"}] ) print(response.choices[0].message.content) ```
The API is OpenAI-compatible. CH data residency is contractually guaranteed. SLA tiers are negotiable.
Path 2: self-host with vLLM on Infomaniak. Apertus 70B in AWQ 4-bit on an Infomaniak H100 instance:
``` docker run --gpus all -p 8000:8000 \ vllm/vllm-openai:v0.6.3 \ --model swiss-ai/Apertus-70B-Instruct \ --max-model-len 32768 \ --gpu-memory-utilization 0.93 \ --quantization awq \ --api-key sk-apertus-prod ```
This command launches Apertus 70B on a single H100 80GB with AWQ 4-bit quantisation (around 45 GB VRAM active). Performance: 25-35 tokens/s per request, aggregated 80-130 tokens/s across all parallel requests.
Path 3: self-host with Ollama. Apertus 8B on smaller hardware (RTX 4070 12GB or Apple Silicon workstation):
``` ollama pull apertus:8b-instruct-q4_K_M ollama run apertus:8b-instruct-q4_K_M "Classify this email: ..." ```
For a fiduciary office workstation with 16-32 GB RAM and no GPU, Apertus 8B in Q4_K_M quantisation runs at 10-20 tokens/sec – productively capable for classification and simple generation tasks.
LiteLLM routing. Apertus integration in LiteLLM config.yaml:
```yaml model_list: - model_name: apertus-70b-ch litellm_params: model: openai/apertus-70b api_base: https://api.swisscom.ch/apertus/v1 api_key: os.environ/SWISSCOM_APERTUS_KEY - model_name: apertus-8b-local litellm_params: model: openai/apertus-8b api_base: http://localhost:11434/v1 api_key: dummy router_settings: routing_strategy: simple-shuffle ```
With this configuration, an application can use "apertus-70b-ch" for complex cases via Swisscom and "apertus-8b-local" for simple classification via the local Ollama endpoint.
Monitoring. Prometheus metrics from vLLM or Ollama, LiteLLM audit logs in Loki, Grafana dashboard with p95 latency and model-choice distribution. For FINMA SN 08/2024 Pillar 3, a prompt hash (not plain text) is additionally logged per request and the model output is stored with a confidence score.
Fine-tuning. The Apache 2.0 licence allows modification. LoRA fine-tuning on an H100 with Apertus 8B as the base takes 4-8 hours for a 50,000-example dataset. Important practice: training data must be prepared FADP-compliantly (client data without clear consent is excluded). Boutiques such as LatticeFlow and Inspire AI Schweiz offer fine-tuning services with a clear data-protection pipeline.
Apertus to production in 5 steps
- 01Choose hosting path: Swisscom API for fastest start, Infomaniak GPU instance for medium sovereignty, own rack with 2x H100 for maximum control.
- 02Model variant: Apertus 70B for standard workloads (complex reasoning, client correspondence, legal analysis), Apertus 8B for edge use cases (classification, triage, fast answers).
- 03Runtime setup: vLLM for production throughput on GPU, Ollama for workstation setups or Apple Silicon. Both OpenAI-compatible.
- 04LiteLLM wiring: enter Apertus as a provider, define routing rules ("sensitive client data to Apertus, FR-specific queries to Mistral, hard reasoning cases to Claude").
- 05Monitoring and compliance: Prometheus on runtime, Loki for audit logs, Grafana dashboard with p95 latency and model-choice distribution. Update FINMA quarterly report and DPIA every quarter.
When to use Apertus
Apertus is the right choice when (a) CH data sovereignty or professional secrecy per Art. 321 SCC are central, (b) Romansh, Schwizerdütsch or CH-specific language must be trained, (c) an Apache 2.0 licence for commercial self-host is required, or (d) training data transparency for FINMA SN 08/2024 or EU AI Act Art. 50 is mandatory.
Concrete cases: law firm with client correspondence in Schwizerdütsch – Apertus 70B self-host on two H100 in Geneva or own rack. Fiduciary firm in Grisons with mixed DE-IT-RM mandates – Apertus 70B via Swisscom API as default, Apertus 8B self-host for triage. Insurance contract pipeline with Swiss legal framework – Apertus 70B with LoRA fine-tuning on insurance-specific vocabulary. Wealth management office with Swiss top-tier mandates – Apertus 70B on-premises, no traffic egress from the CH perimeter.
For pragmatic multi-provider setups in May 2026, Apertus is the sovereign anchor: highly sensitive workloads to Apertus, FR/IT-specific workloads to Mistral, top frontier reasoning fallback to Claude. A routing rule in LiteLLM dispatches by request classification, an audit trail lands in Loki, a FINMA report is generated quarterly.
When not to use
For complex top frontier reasoning at math olympiad level or for legal four-step argument, the current top Claude model or the current top GPT model are still markedly ahead. Apertus 70B sits at 78-82 points on MMLU – solid but not at the peak.
For pure code generation, Qwen2.5-Coder, the current DeepSeek-V generation or the current top Claude model are productively stronger. Apertus is not primarily trained as a code model.
For long-context workloads beyond 128k tokens (complete multi-year client files), Llama 4 Scout with 10M context is the right choice, not Apertus.
For vision-language cases (receipt photo classification, contract scanning), other models (Llama 4 Scout, Pixtral, QwenVL) must be used until the Apertus Vision release (Q4 2026 - Q2 2027).
For setups with very low budget and under 5 million tokens per month, the Apertus self-host hardware investment is not amortisable. Here Apertus via Swisscom API or a cheaper cloud model is the more economical choice.
For latency-critical real-time chat UI, Apertus self-host TTFT (time to first token) is typically 300-600ms on two H100s – acceptable for streaming chat, rather weak for voice bots. Cloud models with optimised inference pipelines (Claude, the current top GPT model via Groq) are faster here.
Trade-offs
STRENGTHS
- Apache 2.0 licence without restrictions – commercial self-host and fine-tuning allowed
- Only frontier model with Romansh, Schwizerdütsch and Ticino-IT capability
- Three hosting paths with different sovereignty profiles (Swisscom API, Infomaniak, on-premises)
- Fully documented training data eases FINMA SN 08/2024 and EU AI Act audits
WEAKNESSES
- Top frontier reasoning slightly behind the current top Claude model and the current top GPT model
- No official vision or audio model until Apertus 2 (expected Q4 2026 - Q2 2027)
- On-premises hosting requires 2x H100 with initial investment CHF 80-150k
- Code generation weaker than Qwen2.5-Coder or the current DeepSeek-V generation
FAQ
What does Apertus self-host cost vs Swisscom API?
Swisscom API: around CHF 0.40-1.50 per 1M tokens (70B), CHF 0.05-0.20 per 1M tokens (8B). At 5 million tokens per month, that is CHF 2-7.5 for 70B. Self-host on Infomaniak: CHF 6,000-12,000 per month for an H100 instance. Self-host in own rack: around CHF 80,000-150,000 initial for 2x H100, plus CHF 500-1,000 per month for power and maintenance. Rule of thumb: below 10M tokens/month, Swisscom API pays off; above 50M tokens/month, self-host pays off.
Which runtime is the best choice for Apertus?
For production with high throughput: vLLM (10-20 tokens/s per request, aggregated 80-130 across all on 2x H100). For smaller setups or workstations: Ollama (convenient, OpenAI-compatible, good on Apple Silicon). For multimodal setups: LocalAI with Apertus as LLM backend. For hub proximity and multi-model testing: Text Generation Inference. Apertus runs productively on all four.
Can I run Apertus 70B on one H100?
Yes, in AWQ 4-bit quantisation. Memory need around 45 GB VRAM active plus KV cache, fits comfortably on one H100 80GB. Performance: 25-35 tokens/s per request, aggregated 50-80 tokens/s across all parallel requests. For higher throughput, tensor-parallel on 2x H100 pays off (aggregated 80-130 tokens/s). FP16 original demands two H100.
When does Apertus 2 arrive?
Official release date not communicated as of May 2026. ETH/EPFL published a February 2026 roadmap with components Apertus Voice (CH-DE dialect capability for audio), Apertus Code (programming), Apertus Vision (multimodal), possibly Apertus MoE (200B/active-30B). Reasonable speculation: Voice and Code first (Q4 2026 - Q1 2027), Vision and MoE later (Q2-Q4 2027). Until then Apertus 8B and 70B are the productive versions.