TGI · TECH
Text Generation Inference (TGI): production serving from the Hugging Face universe
TGI is Hugging Face's Apache 2.0 inference server for production workloads with continuous batching, FlashAttention and direct Hugging Face Hub integration.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is Text Generation Inference?
Text Generation Inference (TGI) is an open-source inference server for large language models, developed by Hugging Face and available in version 2.4+ under Apache 2.0 as of May 2026. The repository lives at github.com/huggingface/text-generation-inference with over 10,000 GitHub stars. TGI is Hugging Face's official production serving solution – the same infrastructure powers the inference endpoints and inference API on huggingface.co.
Unlike vLLM, which lives as a pure open-source project under UC Berkeley / Linux Foundation, TGI is tightly tied to the Hugging Face Hub: every model on huggingface.co can be loaded directly into TGI without a conversion step. "model = huggingface/Llama-3.3-70B-Instruct" is enough – TGI loads the weights, recognises the architecture automatically, starts inference. This tight hub integration is the most important differentiator from vLLM.
Functionally, TGI and vLLM overlap heavily: both offer continuous batching (TGI calls it "request batching"), both use FlashAttention 2 / FlashAttention 3 for efficient attention compute, both support tensor parallelism and quantisation (AWQ, GPTQ, INT8, FP8, EETQ, Marlin kernel), both offer OpenAI-compatible endpoints. Differences are in the details: TGI is complemented by the sister project text-embeddings-inference for embeddings, TGI behaves more robustly on very rare model architectures, vLLM is typically 10-20 percent ahead on pure throughput.
Version May 2026: TGI 2.4+ with support for Llama 4 Scout and Maverick, Mistral Large 2, Qwen 3, the current DeepSeek-V generation, Apertus 8B and 70B, plus vision-language models (Idefics 3, Qwen2-VL, LLaVA-NeXT). Quantisation via bitsandbytes, GPTQ, AWQ, EETQ, FP8 and Marlin.
The interface is OpenAI-compatible on POST /v1/chat/completions plus native endpoints under /generate and /generate_stream. So existing OpenAI SDKs, LiteLLM and LangChain work without adaptation.
Why TGI matters for Swiss data
TGI has four concrete arguments for Swiss fiduciary and law setups.
First: Hugging Face Hub proximity as an operations advantage. Whoever regularly tests across models (Apertus vs Mistral vs Llama vs DeepSeek) saves model conversion with TGI. A new model is productive in three minutes. With vLLM or llama.cpp, conversion from HF safetensors to GGUF or AWQ quantisation is often a separate step. TGI is operationally more direct here.
Second: stability on rare architectures. Some open-weight models have special architectures – multi-query attention with unusual configurations, MoE variants with custom routing logic, vision-language combinations. TGI handles these well through hub integration: every model that lands on huggingface.co is typically also tested in TGI. vLLM is faster, but on rare architectures small friction points occasionally appear.
Third: embeddings via the sister project. text-embeddings-inference (TEI) comes from the same Hugging Face team, has the same operations logic and serves embedding models like bge-large, e5-large, gte-large or nomic-embed-text. Whoever sets up a RAG pipeline has two compatible building blocks with TGI + TEI – same Docker setup, same monitoring logic, same API conventions.
Fourth: licence and sovereignty. TGI is Apache 2.0. It runs entirely in your own rack or on a Swiss GPU instance at Infomaniak. Hugging Face as a company is US-based, but the software itself has no external dependencies – no data leaves the server. For FINMA SN 08/2024 and EU AI Act Art. 10 compliance this is sufficient as long as models are loaded once initially from the hub (air-gapped setups pre-load models via the huggingface-cli tool).
Fifth, operational detail: TGI has a very mature telemetry pipeline. Prometheus metrics are deep – per layer, per token, per request. Audit logs can be extracted in structured JSON format. Valuable for regulated setups (FINMA, EU AI Act Art. 15).
How TGI works technically
TGI is a hybrid implementation: a Rust router process for request handling and a Python inference worker per GPU or tensor-parallel group.
Setup example. On a Hetzner server with two H100 80GB:
``` docker run --gpus all --shm-size 4g -p 8080:80 \ -v $PWD/data:/data \ ghcr.io/huggingface/text-generation-inference:2.4.1 \ --model-id meta-llama/Llama-4-Scout-17B-Instruct \ --max-input-tokens 128000 \ --max-total-tokens 132000 \ --num-shard 2 \ --quantize awq ```
This command launches Llama 4 Scout on two GPUs with tensor-parallel split, AWQ 4-bit quantisation and 128k input context. The API is available at http://localhost:8080/v1/.
Router-worker architecture. The Rust router accepts requests, performs authentication (--api-key) and puts them into a queue. The worker process (Python with PyTorch) pulls batches from the queue and feeds them to the model. This split allows robustness: if the worker crashes, the router keeps collecting requests and restarts the worker without external clients losing connection.
FlashAttention. TGI uses FlashAttention 3 on Hopper GPUs (H100, H200, B200) as of May 2026 – an attention implementation optimised for tensor cores with 1.5x to 2x speedup over standard attention. On Ampere GPUs (A100, RTX 4090), FlashAttention 2 is used.
Quantisation choice. Recommendation May 2026 for typical setups: AWQ (Activation-aware Weight Quantization) at 4 bits for 70B models. AWQ delivers the best quality at 4 bits because it accounts for activation distribution. FP8 on Hopper GPUs is an alternative with about 2x speed over FP16 but slightly higher quality loss (1-2 percent on MMLU).
Streaming responses. GET /generate_stream delivers server-sent events with token-by-token output. Important for chat UIs with progressive rendering. The OpenAI-compatible streaming interface on /v1/chat/completions with stream=true works identically.
Production pattern Switzerland May 2026. A law firm runs two TGI nodes in Geneva (Infomaniak) and in Zurich (own rack), both with Apertus 70B on two H100s. An Nginx layer in front with active-active routing. LiteLLM in between for fallback logic (if a TGI node fails, the request goes to Mistral La Plateforme). Audit logs from TGI stdout into Loki, metrics into Prometheus, alerts in Grafana on p95 latency and queue depth.
Embeddings via TEI. Complementarily, text-embeddings-inference runs on a smaller server (CPU-only or RTX 4060 suffices). Example: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 --model-id BAAI/bge-large-en-v1.5. The API matches OpenAI /v1/embeddings.
TGI to production in 5 steps
- 01Hardware check: NVIDIA GPU with at least 24 GB VRAM (RTX 4090, L40S, H100), CUDA 12.4+, Linux with Docker.
- 02Model choice: define HF Hub ID (e.g. meta-llama/Llama-4-Scout-17B-Instruct, swiss-ai/Apertus-70B-Instruct), pick quantisation (AWQ for 4 bit, FP8 for Hopper GPUs).
- 03Start the Docker container: ghcr.io/huggingface/text-generation-inference:2.4.1 with --model-id, --max-input-tokens, --num-shard (if tensor-parallel) and --quantize.
- 04LiteLLM wiring: enter TGI as an OpenAI-compatible provider, logical model name in config.yaml, define routing rules.
- 05Add TEI for embeddings: separate container ghcr.io/huggingface/text-embeddings-inference with bge-large or nomic-embed-text. Prometheus on both endpoints, Grafana dashboard with p95 latency and queue depth.
When to use TGI
TGI is the right choice when (a) the team tests many different Hugging Face models, (b) vision-language models or other multimodal architectures should run productively, or (c) embeddings and LLM serving should run on a unified operations logic.
Concrete cases: consulting boutique with an experimental character – four different models (Apertus 70B for sovereignty, Mistral for FR/IT, Llama 4 Scout for long context, Qwen 3 for code) on the same infrastructure. TGI makes switching between models fast. Law firm with a vision pipeline (contract photo scan, signature recognition, OCR pre-processing with vision-language model) – Idefics 3 or Qwen2-VL in TGI. Fiduciary office with a productive RAG setup – TGI for LLM, TEI for embeddings, unified operations pipeline.
For setups with future multi-model needs, TGI is also forward-looking: any new model release on huggingface.co is productively testable within hours. The operations logic remains the same.
When not to use
For pure throughput maximisation on a single, stable model, vLLM is typically 10-20 percent ahead in May 2026. Whoever needs the last drop of GPU efficiency goes to vLLM.
For CPU-only setups, TGI is not the right choice – the architecture is GPU-optimised. Here llama.cpp or Ollama are better.
For Apple Silicon setups, the same applies – TGI uses CUDA and ROCm, not Metal.
For multi-modal setups with Whisper, Stable Diffusion and TTS, LocalAI is more convenient because it bundles all modalities under one API. TGI focuses on language and vision-language, but covers no audio and no image generation.
For small setups with one or two users, TGI is over-engineered – Ollama on the same hardware is faster to set up and delivers comparable results.
Trade-offs
STRENGTHS
- Direct Hugging Face Hub access – model switching without conversion step
- Apache 2.0 licence, fully self-hostable, deployable EU/CH-compliant
- Robust on rare architectures and vision-language models
- Sister project TEI for embeddings yields a coherent RAG operations pipeline
WEAKNESSES
- Pure LLM throughput typically 10-20 percent behind vLLM
- GPU is mandatory – no CPU or Apple Silicon mode
- Configuration effort somewhat higher than Ollama, especially with tensor-parallel setups
- Hugging Face Hub dependency for initial model download (solve in advance for air-gapped setups)
FAQ
When TGI instead of vLLM?
When the team tries many different Hugging Face models, TGI is faster through direct hub integration. On rare architectures (new MoE variants, exotic vision models), TGI is typically more stable. If embeddings should also run locally, TGI + TEI is a coherent family. For maximum throughput on a stable model, vLLM is ahead.
How do vision-language models run in TGI?
Since TGI 2.0 (2024), vision-language models are natively supported. Productive in May 2026: Idefics 3 (HF's own family), Qwen2-VL (Alibaba), LLaVA-NeXT, Pixtral (Mistral). Images are transmitted via multipart/form-data or Base64-encoded JSON fields. The OpenAI-compatible interface accepts image_url inputs per the GPT-4 Vision spec.
Do I need a Hugging Face account?
Only for models behind a licence acceptance gate (Llama family, some Mistral models). Apache 2.0 models like Apertus or Qwen Apache 2.0 variants are downloadable without an account. If needed: pass HF_TOKEN as an environment variable to TGI. Air-gapped setups pre-load models via the huggingface-cli tool and copy them into the isolated network.
What performance does TGI deliver on two H100s?
Example May 2026 with Apertus 70B in AWQ 4-bit on two H100s with tensor-parallel: aggregated 100-160 tokens/s across all parallel requests, p50 latency per request around 3-5 seconds for 200-token responses, p95 latency under 8 seconds at 20 parallel requests. TTFT typically under 300ms. With Llama 4 Scout on one H100, 250-400 tokens/s aggregated are possible.