PIPER TTS · TECH

Piper: the open-source local TTS system for privacy-sensitive applications

Piper is an MIT-licensed local text-to-speech system on ONNX. Free, fully local, very fast on CPU. May 2026 with good German voices, hobby-grade quality, good for internal tools.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Piper?

Piper is a local text-to-speech system developed by the Rhasspy project (Mike Hansen, Nabu Casa). The repository github.com/rhasspy/piper is MIT-licensed and part of the Home Assistant ecosystem. As of May 2026 the project is active (running releases), not commercially funded, with about 6,500 GitHub stars.

The architecture is based on VITS – a variational inference TTS model – compiled into ONNX format for platform-independent execution. ONNX Runtime works on CPU, GPU, and Apple Silicon. This makes Piper extremely portable: a single statically linked binary (about 5 MB) plus a voice model (about 60-120 MB per voice) – runs on Raspberry Pi 5, laptop, server. No Python dependency at runtime, no CUDA setup.

The voice library in May 2026 spans over 100 voices in more than 30 languages. German is well covered with voices like de_DE-thorsten-medium (male), de_DE-eva_k-x_low (female), de_DE-karlsson-low. Quality tiers are x_low, low, medium, high – with high delivering the best sound quality at larger model size. Swiss German is not available.

Performance is remarkable. On a Raspberry Pi 5, Piper hits about 0.5x real time (10 seconds of audio in 20 seconds), on a normal laptop (Intel i5 or M2) 2-5x real time. With GPU, speed exceeds 10x real time. That means short replies (1-3 sentences) are generated in 100-300 ms – fast enough for live telephony on modest hardware.

Sound quality in May 2026 sits between the old Festival/eSpeak systems and ElevenLabs. Persistent drawbacks: slightly robotic prosody, not fully natural breaths, sometimes monotone pitch over longer sentences. For speaker-recording quality Piper does not suffice – for internal tools, IVR announcements, technical note reading, it does.

Why it matters

For Swiss fiduciaries and SMEs with data-protection sensitivity, Piper is in May 2026 the only option for local, free TTS. Three arguments make the difference.

First: data residency. ElevenLabs, Azure TTS, and Google TTS generate audio in US or EU cloud – client text must be sent there for processing. For professional-secrecy data (Art. 321 StGB) this is not legal without consent. Piper runs on-premise – client text never leaves the own server. That makes Piper the only TTS with guaranteed Swiss data residency.

Second: cost. ElevenLabs Creator costs USD 99/month for 100k characters, enough for 100-120 minutes of audio. A fiduciary with 10 hours of monthly TTS demand (e.g. for a voice assistant for client FAQ) would need several tiers higher. Piper is free after hardware acquisition (CHF 200 for an ARM mini-PC). For any volume above 100 hours/month, Piper pays off in the first month.

Third: availability. Cloud TTS fails when internet drops, the provider has downtime, or the quota is exceeded. Piper runs locally – no network outage, no provider risk, no rate limits. For critical infrastructure (emergency IVR, internal on-call announcer) locality is a security argument.

The trade-off is quality. Piper sounds noticeably synthetic – clients notice no human is speaking. Acceptable for internal tools, technical announcements, and low-frequency applications. For premium voice agents that must build trust, Piper is too audibly synthetic. Pragmatic middle ground: Piper as fallback when cloud TTS fails, ElevenLabs as primary.

How it works

Installation is trivial. On Linux:

wget https://github.com/rhasspy/piper/releases/latest/download/piper_amd64.tar.gz tar -xzf piper_amd64.tar.gz cd piper wget https://huggingface.co/rhasspy/piper-voices/resolve/main/de/de_DE/thorsten/medium/de_DE-thorsten-medium.onnx wget https://huggingface.co/rhasspy/piper-voices/resolve/main/de/de_DE/thorsten/medium/de_DE-thorsten-medium.onnx.json echo "Hello, welcome to our office." | ./piper --model de_DE-thorsten-medium.onnx --output_file welcome.wav

The binary takes text from STDIN and generates audio as WAV. Options include --output_raw for pipe-to-player, --sentence_silence for pauses between sentences, --length_scale for speech speed, --noise_scale for voice variation.

For server use a HTTP wrapper exists. piper-tts-server (Python package) starts a FastAPI server on port 5000:

pip install piper-tts uvicorn piper_tts.server:app --host 0.0.0.0 --port 5000 curl -X POST http://localhost:5000/synthesize -d "Hello world" -o reply.wav

Alternatively via Home Assistant: Piper runs as an add-on in the Hassio stack with web UI and a ready HTTP API. The easiest variant for smart-home setups.

For telephony integration with Twilio: a Node server takes LLM replies, calls Piper via HTTP (or spawn subprocess), receives WAV, converts to ULAW (8 kHz) via ffmpeg, and sends over Twilio Media Streams to the caller. Latency: Piper generation 200-400 ms plus conversion 50 ms plus network 100 ms – total under 600 ms. That makes Piper telephony-capable, if without ElevenLabs polish.

Voice selection: the German model universe on Hugging Face (rhasspy/piper-voices) in May 2026 covers about 15 German voices. The thorsten voice (mid-quality) is the most used and sounds the most natural. eva_k is a female alternative. The karlsson voice is deeper and calmer. All models are available under MIT with no license cost.

For specialised vocabulary (proper names, technical terms) there is no pronunciation-dictionary function as with ElevenLabs Pro. Workaround: phonetic spelling in the input (e.g. "Müller" as "Müller" or "Myoo-ler" as needed). For many problematic terms a pre-processing layer with rule-based replacement pays off.

Piper setup in 5 steps

01Prepare hardware: Raspberry Pi 5 (CHF 90 + PSU) is enough for light load, an ARM mini-PC (CHF 200) or Hetzner CPX31 (CHF 18/Mo) for server use. GPU optional for sub-100 ms latency.
02Install binary and voice: download Piper release from GitHub, add a German voice from Hugging Face (de_DE-thorsten-medium as default).
03Start HTTP server: piper-tts-server package or own FastAPI wrapper on port 5000. systemd unit or PM2 for auto-restart on crash.
04Integrate into application: HTTP request to /synthesize, process WAV response. For telephony convert to ULAW via ffmpeg. Pronunciation pre-processing with rule table for proper names.
05Set up caching: generate frequent phrases (greeting, standard announcements) once and store in object storage or local file cache – latency below 10 ms for repeated content.

When to use Piper

Piper is the right choice when (a) data-protection requirements force local processing, (b) volume exceeds 100 hours of TTS per month, or (c) a local fallback for cloud outages is desired.

Concrete cases: a law office builds an internal voice-notes app for dictation – Piper local on a NAS or server, no cloud transfer of client content. A fiduciary builds an IVR intro for incoming calls (greeting, options) – Piper provides the stock announcements, generated once, cached. A Home Assistant setup announces smart-home status – Piper as add-on, no internet dependency.

Also for emergency fallback: a cloud TTS outage or quota exhaustion can have Piper serve as the second outlet, keeping calls running, if less naturally. Clients still hear a voice instead of a robot tone.

When not to use

For premium voice agents with high client trust demand, Piper is too recognisably synthetic. Anyone needing client-facing speech close to speaker-recording quality must use ElevenLabs or Azure TTS.

For voice cloning Piper is not designed – the VITS architecture allows it technically, but Piper has no workflow and no voice library for cloning. ElevenLabs is the right choice here, with the legal caveats.

For extremely latency-critical applications (sub-200 ms), Piper is borderline – on CPU 200-400 ms for short replies. With GPU it drops to 50-100 ms, but setup effort rises and the data-residency advantage stays (local GPU vs. cloud).

For multilingual applications with code-switching inside a sentence (e.g. "Please call our office in Zurich") Piper is weak – one language model per request, no direct mid-text switch.

For Swiss German there is no Piper model. Anyone wanting dialect replies must take a different path (e.g. pre-recorded clips for typical phrases).

For high-quality audio production (audiobooks, podcasts, ads) Piper is not first choice – ElevenLabs delivers clearly better quality.

Trade-offs

STRENGTHS

Free and MIT-licensed, fully local, guaranteed Swiss data residency
Very fast on CPU – viable latency on a Raspberry Pi 5
Over 100 voices in 30+ languages, well-maintained High German models
No quotas, no rate limits, no provider risk

WEAKNESSES

Recognisably synthetic – insufficient for premium voice agents
No voice-cloning workflow, no pronunciation dictionary
No Swiss German voice, no in-sentence code-switching
Setup, updates, and voice maintenance are self-service without vendor support

FAQ

How does Piper sound vs. ElevenLabs?

Recognisably synthetic. Piper has usable prosody and good pronunciation for High German, but breaths, pitch variation, and accent variation are noticeably limited. ElevenLabs in May 2026 is almost indistinguishable from speaker recording. Pragmatic listening test: render 3 typical sentences from the use case in both systems and compare.

What hardware does Piper run on?

Practically everything: Raspberry Pi 5, Apple Silicon, Intel/AMD CPU with AVX2, NVIDIA GPU. In May 2026 ARM and Apple Silicon support is stable, Windows builds available. Memory needs 200-400 MB per voice in RAM, disk 60-120 MB per voice.

Can I train my own voice?

Yes, but effortful. Piper offers a training pipeline via piper-train based on VITS. Needs about 5-20 hours of speaker recordings in clean quality plus several days of GPU training. Rarely the right investment for an SME – stock voices usually suffice.

Are there Swiss German voices?

No, not in May 2026. The Piper voice library has no Swiss German voice. High German voices are the only option for DE applications. Anyone needing dialect must use pre-recorded clips.

Sources

rhasspy/piper – GitHub repository and releases · 2026-05
Piper voice library on Hugging Face (rhasspy/piper-voices) · 2026-05
Home Assistant – Piper add-on documentation · 2026-04
VITS paper – Conditional Variational Autoencoder for End-to-End TTS · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call