LOCALAI · TECH

LocalAI: OpenAI-API-compatible all-rounder for LLM, TTS, STT and vision in one box

LocalAI is an MIT-licensed self-hosting server that bundles LLM, image, audio and embeddings under one OpenAI API. Bare-metal or Docker.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is LocalAI?

LocalAI is an open-source self-hosting server for multimodal AI workloads, started in 2023 by Ettore Di Giacinto and available in version 2.x under the MIT licence on github.com/mudler/LocalAI as of May 2026. The project has more than 25,000 GitHub stars and an active community; the official portal lives at localai.io.

The special thing about LocalAI: it is not just an LLM runtime but an orchestrator merging different model types behind a single OpenAI-compatible API. A LocalAI instance answers POST /v1/chat/completions (LLM chat), POST /v1/embeddings (embedding vectors), POST /v1/audio/transcriptions (Whisper speech-to-text), POST /v1/audio/speech (TTS), POST /v1/images/generations (Stable Diffusion / Flux) and POST /v1/rerank (BGE reranker). All endpoints follow the OpenAI specification – i.e. an existing OpenAI client can be redirected to it without adaptation.

Under the hood, LocalAI runs as a Go-based service that orchestrates multiple backend engines: llama.cpp for LLMs, whisper.cpp for speech recognition, Diffusers or stablediffusion.cpp for image generation, eSpeak-NG / Bark / Piper for TTS, sentence-transformers for embeddings. The backend choice per model is configured via YAML; LocalAI loads the respective engine on demand in a subprocess.

Deployment options May 2026: bare-metal installation as a Go binary on Linux, Docker image localai/localai with various tags (cpu, gpu-nvidia, gpu-amd, all-in-one), Kubernetes deployment with Helm chart, Docker Compose setups for edge use cases. The "all-in-one" variant ships with preinstalled models – Llama 3.3, Whisper, Stable Diffusion 3.5 – and is intended for quick demos.

Version May 2026: 2.20+ with the important addition "P2P federation" (several LocalAI nodes can share models and load) and "function calling" (tool use through OpenAI-compatible JSON schemas on any capable model).

Why LocalAI matters for Swiss data

LocalAI targets a specific configuration common in Swiss fiduciary and law offices: several AI modalities in one compliance box.

First: multi-modality without provider sprawl. A law office that wants to transcribe dictations (Whisper), summarise contracts (LLM), find legal citations in the internal archive (embeddings) and enrich client presentations with generated graphics (Stable Diffusion) can do all this with LocalAI on a single machine in a Swiss data centre. The alternative – four different cloud providers with four contracts, four DPAs per FADP Art. 9, four audit trails – is markedly more complexity.

Second: full data sovereignty. LocalAI runs entirely inside the own network perimeter. No external API call, no telemetry, no hidden model synchronisation. For clients under professional secrecy per Art. 321 SCC, this is the strongest form of "data in our rack". FINMA SN 08/2024 Pillar 1 (data classification) and EU AI Act Art. 10 (data governance) are comparatively easy to fulfil here.

Third: OpenAI-compatible API as switch protection. Whoever uses LocalAI as a backend and tomorrow finds that a particular model runs better via Mistral La Plateforme changes a LiteLLM routing rule – not code in twenty microservices. Applications talk to LocalAI the way they talk to OpenAI; the switch is an address, not a migration.

Fourth: bare-metal option without Docker. Some Swiss banks and insurers have not approved Docker as container technology (old security policies). LocalAI installs as a static Go binary directly on a Linux server – no container runtime needed. A practical argument in SecOps conversations with conservative IT departments.

Fifth: multi-tenancy via API keys. LocalAI supports multiple API keys with different model permissions. In a law firm with separated client compartments you can issue an own key per client and filter the audit logs by client key. Important for cleanly documented evidence preservation.

How LocalAI works technically

LocalAI is a Go service that loads a model configuration file per model and starts the matching backend process on call.

Setup example. On a Hetzner server with RTX 4090 (24 GB VRAM):

``` docker run --gpus all -d \ -p 8080:8080 \ -v $PWD/models:/build/models \ -v $PWD/config:/build/config \ --name localai \ localai/localai:v2.20.1-aio-gpu-nvidia-cuda-12 ```

The all-in-one variant loads Llama 3.3 8B, Whisper-Large-v3, Stable Diffusion 3.5 Medium and nomic-embed-text on first start – around 30 GB download. The API is available at http://localhost:8080/v1/.

Model configuration via YAML. Each model has its own YAML file under /build/config/. Example for Apertus 8B:

```yaml name: apertus-8b backend: llama-cpp parameters: model: apertus-8b-q4_k_m.gguf context_size: 8192 threads: 8 f16: true gpu_layers: 32 rope_freq_base: 500000 template: chat: | {{.System}} User: {{.Input}} Assistant: ```

Whisper for dictations. Speech recognition with the large Whisper-Large-v3 model on CPU or GPU:

``` curl http://localhost:8080/v1/audio/transcriptions \ -H "Content-Type: multipart/form-data" \ -F file="@dictation.mp3" \ -F model="whisper-large-v3" \ -F language="de" ```

Result: JSON with transcript text, optional word timestamps and confidence score. For Swiss lawyer dictations, Whisper-Large-v3 is the productive choice – Swiss Standard German is recognised well; with Schwizerdütsch and Walliserdeutsch it gets hard. Apertus Voice (in development as of May 2026) will be the better solution mid-term.

P2P federation. Multiple LocalAI nodes can discover each other and share models / load. Configured via LOCALAI_P2P_TOKEN. Practical for a law firm with two offices (Zurich and Bern) – load stays local, a model update on one node synchronises to the other.

Function calling and tool use. Models with tool-use capability (Llama 3.3+, Mistral Small 3.1+, Qwen 3, Apertus 70B-Instruct) are addressed via the OpenAI function-calling spec. LocalAI parses tool calls and returns them in OpenAI format.

Monitoring. LocalAI exports Prometheus metrics on /metrics: localai_requests_total, localai_request_duration_seconds, localai_model_load_duration_seconds. Logs via stdout into Loki. Audit logs with prompt hash (not plain text) per request can be enabled via the config flag audit: true.

LocalAI to production in 5 steps

01Hardware check: Linux server with at least 16 GB RAM and ideally an NVIDIA GPU (RTX 4090 / L40S / H100), CUDA 12.4+.
02Start LocalAI via Docker: localai/localai:v2.20.1-aio-gpu-nvidia-cuda-12 for the all-in-one variant with preinstalled models, or localai/localai:v2.20.1-gpu-nvidia-cuda-12 for a minimal image with own model selection.
03Configure models: YAML files under /build/config/ per model, defining backend (llama-cpp, whisper, diffusers), parameters (quantisation, context size, GPU layers).
04Set up API keys: LOCALAI_API_KEY per team or per client, restrict rights to specific models.
05Monitoring and audit: Prometheus on /metrics, Loki for logs, optional audit: true for prompt-hash logging – all prerequisites for FINMA SN 08/2024 and EU AI Act compliance.

When to use LocalAI

LocalAI is the right choice when (a) several AI modalities are needed in one setup, (b) an OpenAI-compatible API without provider lock-in is desired, or (c) bare-metal installation without Docker is required.

Concrete cases: law firm with its own dictation system – Whisper transcription, LLM summary, embeddings-based search in earlier dictations, all in one LocalAI instance. Fiduciary firm with receipt processing – OCR pre-stage (via external tool), LLM classification, embeddings for similar receipts, occasional image generation for client presentations. Insurance broker with mixed workload – claims transcription, claims classification, contract Q&A through embeddings RAG.

For pilot phases, LocalAI is well suited too: the architecture "one API, many modalities" matches the OpenAI world and makes the later switch to a cloud strategy easy. SME providers wanting to build an internal "AI suite" without juggling five different cloud contracts are well served by LocalAI.

When not to use

Whoever only needs LLM inference on a single model is better served by Ollama. LocalAI brings an orchestrator that is unnecessary overhead in pure LLM setups.

For the highest throughput requirements on GPU (50+ parallel requests per second), vLLM is superior – LocalAI uses llama.cpp as its LLM backend, which does not reach the same continuous batching level as vLLM.

For setups in which the individual modalities live on different servers anyway (LLM on GPU server, Whisper on CPU server, Stable Diffusion on another GPU), LocalAI bundling is counterproductive. Cleaner to serve each modality with its optimal tool (vLLM for LLM, Whisper server for STT, ComfyUI for image).

For production multi-tenancy with hundreds of clients and strict SLAs, LocalAI is still too young as of May 2026 – the roadmap is evolving but multi-tenant maturity lags behind established enterprise platforms.

For individual users without server admin skills, Ollama or LM Studio are more convenient.

Trade-offs

STRENGTHS

One OpenAI-compatible API for LLM, Whisper, Stable Diffusion, embeddings and TTS
MIT licence with full source code visibility
Bare-metal installation as a Go binary possible, no Docker requirement
P2P federation allows distributed setups without external providers

WEAKNESSES

LLM throughput below vLLM level – unfit for 50+ parallel requests
Configuration effort higher than Ollama, especially with many models
Multi-tenant maturity lags established enterprise platforms
Model maintenance via YAML files requires a maintenance process

FAQ

How does LocalAI differ from Ollama?

Ollama is an LLM runtime for language models – one model per request, one modality. LocalAI is a multimodal orchestrator – LLM, Whisper, Stable Diffusion, embeddings, TTS, rerank models all under one OpenAI API. Whoever needs only LLMs is faster with Ollama; whoever wants several modalities in one setup is better served by LocalAI.

Which TTS models does LocalAI support?

May 2026: Piper (efficient, acceptable quality, many languages), Bark (expressive, slow, English-leaning), eSpeak-NG (very fast, robotic-sounding), Coqui-TTS in the non-commercial variant, plus XTTS-v2 (multilingual, voice-clone capable). For Swiss Standard German speech output, Piper with the de_CH voice pack is the most stable choice. Apertus Voice is not yet productively available as of May 2026.

Can I install LocalAI without Docker?

Yes. LocalAI is a Go service and installs as a static binary directly on a Linux server. Build from the GitHub repository with "make build" or download the pre-built binary from the releases section. Important for compliance setups in banks and insurers where Docker is not approved.

Is LocalAI EU AI Act compliant?

LocalAI itself is open-source software and does not fall into the high-risk category. EU AI Act duties depend on the use case and the model used. LocalAI advantages for compliance: full data sovereignty, MIT licence with source code access, prompt-hash audit logs activatable. Duty to classify the use case, conduct a DPIA and prepare a model card stays with the operator.

Sources

LocalAI – official documentation · 2026-05
mudler/LocalAI – GitHub repository and releases · 2026-05
LocalAI model gallery – config templates · 2026-04
LocalAI 2.x changelog and P2P federation notes · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call