META LLAMA · LLM PROVIDER
Meta Llama in Swiss practice: open-weight model, self-host or provider
Llama 4 Scout/Maverick and Llama 3.3 70B as the open-weight option. Licence, hardware needs, prices at Groq/Together/Fireworks and self-host reality.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is Meta Llama?
Llama is Meta's LLM family with publicly downloadable weights. In May 2026 three generations are relevant: Llama 3.3 70B (late 2024, the stable workhorse), Llama 4 Scout (April 2025, MoE architecture with 17B active out of 109B parameters, 10M-token context) and Llama 4 Maverick (April 2025, MoE with 17B active out of 400B parameters, 1M context, top of the open-weight world).
Important: open-weight is not open-source. Llama models are released under the Llama Community License, not Apache 2.0 or MIT. The licence allows commercial use as long as the service stays under 700 million monthly active users. A "Built with Llama" attribution is required, and Meta's Acceptable Use Policy must be followed. For a Swiss SME or fiduciary the 700M cap is irrelevant; the attribution lives in the imprint or footer. EU multimodal restriction applies: image/video capabilities are limited in the EU (DSA/AI Act caution by Meta).
Deployment options are three: (a) self-host on your own GPU hardware or a Hetzner GPU server, (b) provider API at Groq, Together AI, Fireworks AI, DeepInfra, Lambda, Novita or Sambanova – all run Llama 4 at prices from USD 0.15 / USD 0.60 per 1M tokens, (c) cloud bridge via Vertex AI (Llama 3.3 70B available) or AWS Bedrock. Which path makes sense depends on workload volume and compliance.
Why it matters
Llama solves a problem closed models structurally cannot: full control over the data flow. Whoever downloads the weights and runs them on their own hardware has no API calls into foreign clouds. For a fiduciary office bound by professional secrecy (Art. 321 SCC) that is the only way to process certain sensitive data without asking the client for separate consent.
The second point is cost predictability. Your own Hetzner GPU box (e.g. an A100 80GB for roughly CHF 600/month) allows unlimited inference for Llama 3.1 70B Q4. With a cloud provider at USD 0.30 per 1M output tokens you reach the same bill around 2M output tokens per month – and you are locked into a usage model. Whoever wants predictable billing is better off with self-host.
Third point: vendor lock-in is zero. You can run weights on a Hetzner GPU today, on AWS G5 next month, hybrid with burst-to-Groq the month after. The interface is OpenAI-compatible and identical. Compared with OpenAI or Claude lock-in (model gone = stack gone), Llama is insurance.
Fourth: pace of innovation. Llama 4 Scout has a 10M-token context window – more than any closed model at release. Open-weight models do not fully close the gap to the current top GPT model and the current top Claude model, but they are stable enough for 70% of fiduciary use cases.
How it works
Self-host path: download weights from llama.com or Hugging Face (licence acceptance required), pick quantisation (Q4_K_M is the standard for hardware efficiency), stand up an inference server (Ollama or vLLM for production), expose an HTTP endpoint (OpenAI-compatible).
Hardware requirements as of May 2026: Llama 3.1 8B / 3.3 8B runs Q4 in 6-12 GB VRAM, so on an RTX 3060 12GB or a small cloud GPU. CPU-only is possible but slow (under 10 tokens/second on a strong server). Llama 3.3 70B needs around 42 GB Q4 – an A100 80GB, an H100, a Mac Studio M2 Ultra or two paired RTX 3090 24GB. Llama 4 Maverick with 400B total parameters needs around 240 GB in Q4; that is 8x A100 or equivalent – for an SME effectively reachable only via cloud GPU or provider API. Llama 4 Scout (109B) sits between 70B and Maverick, about 60-70 GB in Q4.
Provider API path: open an account at Groq/Together/Fireworks, add a card, generate an API key. The call is identical to OpenAI, only api_base changes: https://api.groq.com/openai/v1 or https://api.together.xyz/v1. LiteLLM routes transparently – same code, different endpoint. Prices May 2026: Llama 4 Maverick from USD 0.15 input / USD 0.60 output per 1M tokens (DeepInfra, Together), at Groq faster (~500 tokens/second) but slightly pricier.
For a Swiss fiduciary with revDSG concerns Groq/Together are US-hosted – not ideal. Hetzner GPU servers in Falkenstein (DE) or Helsinki (FI) are the EU variant; servers in Swiss datacentres (Infomaniak, Exoscale) are the revDSG-clean variant but more expensive.
CIO decision: self-host or provider?
- 01Estimate volume: expected input and output tokens per month. Under 1M tokens/month: provider API. Over 10M: check self-host economics.
- 02Data classification: which data flows through the model? Highest confidentiality tier decides whether a US provider is acceptable or EU/CH self-host is required.
- 03Model size: 8B is enough for classification and simple QA. 70B for legal analysis. Maverick (400B total) only via provider, not SME self-host.
- 04Licence review: have legal sign off on the Llama Community License. Plan the attribution requirement in imprint/footer. Document the EU multimodal restriction.
- 05Provider choice: Together or DeepInfra for EU hosting + lowest prices. Groq for latency-critical applications (live chat).
- 06Self-host variant: Hetzner GPU (A100 80GB from CHF 600/month) or Swiss provider (Infomaniak, Exoscale). vLLM or Ollama with OpenAI-compatible endpoint.
- 07LiteLLM gateway in front: routing rule decides whether Llama, Gemini Flash or Claude is called. Central failover and logging.
When to use Llama
Llama is the right choice when (a) self-hosting is required and no other provider offers the needed control, (b) vendor lock-in should be avoided, or (c) the workload has a large code share – Llama 4 is strong on code tasks.
Concrete uses: a client FAQ bot that sends nothing abroad (Llama 3.3 8B on a Swiss server, Q4, plus Qdrant RAG). Code generation for internal tools (Llama 4 Maverick via Together API, pay-as-you-go). High-volume classification where OpenAI pricing kills the case (Llama 3.3 70B on your own Hetzner GPU, marginal cost near zero).
Vs other open-weight options: Mistral (EU provider) is the alternative with a better EU compliance profile but smaller models. Qwen (Alibaba) and DeepSeek (PRC) are technically strong but legally problematic for Swiss fiduciary work. Llama remains the open-weight default for code and general reasoning.
When not to use
Llama is the wrong choice when (a) the task demands top-end reasoning quality (Claude Opus or the current top GPT model are better), (b) no GPU hardware is available and the workload stays small (a Vertex AI call to Gemini Flash-Lite is cheaper than a Hetzner GPU setup from CHF 600/month), or (c) an Apache 2.0 or MIT licence is mandated by legal – the Llama Community License is not OSI-compliant.
Further cases: if multimodal capabilities (image, video) are to be used in the EU, Meta has restricted Llama 4 multimodal features in the EU due to DSA/AI Act caution – a closed solution like Gemini 2.5 Pro or the current top GPT model is more complete here.
Self-host without MLOps routine is a trap. Loading the model is easy; running production (GPU monitoring, updates, security patches for inference servers, A/B tests of new versions) is a job. Without that routine, stay with a provider – Together, Fireworks or DeepInfra in their EU region.
Mind the 700M MAU clause and EU multimodal restriction: no EU consumer app with Llama 4 Vision without legal review.
Trade-offs
STRENGTHS
- Open-weight: full control over data flow, no API lock-in
- Self-host possible, Swiss sovereignty realistically achievable
- Llama 4 Scout: 10M-token context window, more than any closed model
- Provider competition pushes prices: USD 0.15/0.60 per 1M tokens for Maverick
- Code capabilities in Llama 4 on par with the current top GPT model and the current top Claude model
WEAKNESSES
- Llama Community License is not OSI-compliant – legal review required
- EU multimodal restriction: image/video limited in the EU
- Self-host requires MLOps routine, GPU hardware, ongoing maintenance
- 70B/Maverick demand serious VRAM budget (A100 80GB+ or 8x H100)
- Reasoning peak (legal logic, math proofs) lags Claude/GPT
FAQ
Can I use Llama commercially?
Yes, as long as your service stays under 700 million monthly active users – never a concern for a Swiss fiduciary or SME. You must carry "Built with Llama" in your imprint or footer and comply with Meta's Acceptable Use Policy (no weapons, no abuse, no illegal activity). The licence is not OSI-compliant, so if your legal team explicitly requires Apache 2.0 or MIT, Llama is out.
What hardware do I need for a client FAQ?
For Llama 3.3 8B Q4 an RTX 3060 12GB or a small cloud GPU is enough. For a four-person office with 200 queries/day that is oversized. CPU-only on a strong server (32 cores, 64 GB RAM) delivers ~5-10 tokens/second – sufficient for an FAQ bot if user wait time is acceptable. For 70B or Llama 4 you need GPU or a provider API.
Is Llama 4 or Llama 3.3 the better choice?
In May 2026 Llama 3.3 70B is the more robust choice for fiduciary work: more stable provider landscape, more tested quantisations, established MLOps practice. Llama 4 Scout/Maverick is superior in context window (10M tokens for Scout) and code benchmarks but younger and needs more VRAM. Recommendation: 3.3 70B as the self-host default, Llama 4 via provider API for tasks requiring the long context window.
How does Llama relate to the EU AI Act?
Llama 4 Maverick very likely crosses the 10^25 FLOP threshold and is classified as a general-purpose AI with systemic risk. Meta delivers the model cards and training-data summary required by Art. 53. If you as a Swiss fiduciary use Llama only as a deployer (not training or fine-tuning yourself), the bulk of obligations sits with Meta. Your duties: document the process the model runs in, provide transparency notices to clients.
Related topics
Sources
- Meta – Llama 4 Multimodal Intelligence (release post) · 2025-04
- Llama Community License (Llama 4 variant) · 2025-04
- Llama 4 Pricing Across Providers (DeepInfra, Together, Fireworks) · 2026-05
- Llama 3.1 Hardware Requirements: 8B, 70B, 405B (VRAM guide) · 2026-04
- Llama 4 Complete Developer Guide 2026 (Codersera) · 2026-03