ELEVENLABS · TECH

ElevenLabs: the industry reference for natural TTS voices and voice cloning

ElevenLabs offers a proprietary TTS cloud API with the most natural voices in May 2026. Starter USD 5/Mo, Creator USD 99/Mo. turbo-v2.5 for sub-400 ms telephony latency, 30+ languages, voice cloning available.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is ElevenLabs?

ElevenLabs is an audio-AI company founded in 2022 in London and New York, focused on text-to-speech (TTS). In May 2026 the firm sits at roughly USD 3B valuation (Series C, January 2025) and is the industry reference for natural-sounding synthesised voices. The product is proprietary and only available as a cloud API – no open-source counterpart, no affordable self-host tier for SMEs.

The model family covers three generations. eleven_multilingual_v2 is the quality model – best voice fidelity, higher latency (1-2 seconds). eleven_turbo_v2.5 is in May 2026 the latency model – sub-400 ms audio generation, suitable for real-time telephony. eleven_flash_v2.5 is the fastest, slightly lower-quality variant. Voice cloning is available as Instant Voice Clone (30-second sample) or Professional Voice Clone (several hours of recordings, higher fidelity).

Language coverage in May 2026 spans 32 languages, including German, French, Italian, and English in several accents. Swiss German is not offered as a TTS language – for CH voice agents you generate High German answers, which is usually acceptable.

Hosting in May 2026 spreads across a US primary region and several edge regions (Europe, Asia) for lower latency. Audio generation itself runs in the US cloud – edge regions are only for caching and delivery. For GDPR/revDSG-bound applications that means a BDPA with ElevenLabs is signable but data leaves EU/Switzerland. A transfer-impact assessment is mandatory.

The tier structure is graded. Starter USD 5/month for 30,000 characters (about 30 minutes of audio). Creator USD 99/Mo for 100,000 characters plus professional voice cloning. Pro USD 99 for 100k characters. Scale USD 330/Mo for 2M characters plus PCM 44.1 kHz and 192 kbps MP3 quality. Enterprise on request with guaranteed service levels.

Why it matters

For voice agents the voice is the first thing callers hear and judge. A robotic TTS voice immediately reveals there is no human on the line – clients feel not taken seriously. A natural voice with breaths, pitch variation, and clean articulation builds trust.

ElevenLabs delivers in May 2026 the qualitatively best TTS voices on the market. Coqui (open source) and Piper are usable for internal tools but sound noticeably synthetic. Microsoft Azure TTS and Google WaveNet are technically in the same league as ElevenLabs but 2-3 years behind state of the art. OpenAI TTS (api/audio/speech) has been available since March 2024 and is qualitatively similar but with less voice variety.

For Swiss applications two properties are critical. First: the German High German voice in May 2026 is the best commercially available – almost indistinguishable from a speaker recording. Second: turbo-v2.5 delivers TTS in under 400 ms, allowing a STT+LLM+TTS chain under 1.5 seconds total – a technically viable voice agent.

Voice cloning is the double-edged feature. With 30 seconds of audio ElevenLabs can reproduce a voice – tempting for an office (the manager lends his voice to the bot receptionist). Legally it is a minefield. The revDSG recognises the right to one own voice as part of personality. The EU AI Act (in force May 2026) classes cloned-voice use as a high-risk application when the voice belongs to a real person. Without explicit, documented consent from the source person, voice cloning is a legal risk.

For professional voice agents we recommend in May 2026 exclusively ElevenLabs stock voices (synthetically generated, no real person behind them) – Rachel, Antoni, Adam, and Dorothy are tested. Own voice clones only with consent and contract.

How it works

The ElevenLabs API exposes several endpoints: /v1/text-to-speech/{voice_id} for standard generation, /v1/text-to-speech/{voice_id}/stream for streaming audio, /v1/voices for voice inventory and cloning, /v1/history for past generations, /v1/user for quota status.

Example standard cURL:

curl --request POST \ --url "https://api.elevenlabs.io/v1/text-to-speech/21m00Tcm4TlvDq8ikWAM" \ --header "xi-api-key: $ELEVEN_API_KEY" \ --header "Content-Type: application/json" \ --data "{\"text\":\"Hello, how can I help you?\",\"model_id\":\"eleven_turbo_v2_5\",\"voice_settings\":{\"stability\":0.5,\"similarity_boost\":0.75}}" \ --output reply.mp3

Voice settings are critical. Stability (0-1) controls how consistent the voice stays across long text – high for news, low for artistic reading. Similarity-boost (0-1) controls how close to the original – high for voice-cloning fidelity, low for more variation. Style (0-1, multilingual_v2 only) controls expressive intensity. Use_speaker_boost (true/false) amplifies typical voice features.

For streaming against Twilio: /v1/text-to-speech/{voice_id}/stream sends audio chunks as ULAW or MP3 as soon as the first tokens are generated. turbo-v2.5 delivers first bytes in 100-200 ms – that makes real-time telephony viable. Twilio Media Streams receives the audio chunks and plays them over the phone channel.

Voice cloning via /v1/voices/add: POST with multipart audio (mp3, wav), together with name, description, and labels. Instant Voice Clone needs 30 seconds to 5 minutes of audio, Professional Voice Clone 30 minutes to several hours. The cloned voice returns with a voice_id and can be used via the same TTS endpoints.

Concepts like pronunciation dictionaries (own lexicon for pronouncing proper names, brands, technical terms) are available from the Pro tier – important for fiduciary voices, because client names would otherwise be mispronounced.

The web UI additionally offers dubbing (auto-translation plus voice-clone for videos), voice library (community voices), and studios for multilingual audio production. Not relevant for API-based voice agents.

ElevenLabs setup in 5 steps

01Sign up at elevenlabs.io, pick a tier (Starter for tests, Creator for voice cloning, Pro/Scale for production). Review BDPA and data-protection docs, write a transfer-impact assessment.
02Pick a voice: stock voices from the library (Rachel, Antoni, Adam, Dorothy as tested German) – or clone your own with speaker consent via /v1/voices/add. Tune voice settings (stability, similarity_boost) via a pilot test.
03Pick a model: eleven_turbo_v2_5 for real-time telephony (sub-400 ms), eleven_multilingual_v2 for quality production (1-2 seconds), eleven_flash_v2_5 for fastest responses.
04Integration: secure the API key, wire /v1/text-to-speech/{voice_id}/stream to Twilio Media Streams or your own audio player. Maintain a pronunciation dictionary for proper names and technical terms.
05Monitoring: /v1/user endpoint polls quota status, Telegram alarm at 80 percent usage, audit log with per-application character count for cost separation.

When to use ElevenLabs

ElevenLabs is the right choice when (a) voice quality matters, (b) latency under 500 ms is required and (c) US cloud hosting is acceptable.

Concrete cases: a voice agent for appointment booking at an insurer – High German stock voice via turbo-v2.5, client consent to cloud transfer of speech obtained at call start. A learning platform with audio companions to modules – standard voices via multilingual_v2, higher quality, latency unimportant. An audiobook producer creates synthetic speaker voices for non-fiction – Professional Voice Clone with speaker consent, studios for multi-chapter workflows.

For marketing audio (radio spots, explainer videos, podcast intros) ElevenLabs is a fast solution with the USD 99/Mo Pro tier – no more expensive studio booking.

When not to use

For internal tools with low quality requirements, Piper (local, free) is sufficient – ElevenLabs quality is overkill when only a few notes get read aloud.

For strictly GDPR/revDSG-bound applications (client data without consent), ElevenLabs is problematic. Data is processed in the US even with edge caching closer by. A transfer-impact assessment is mandatory.

Voice cloning without documented consent from the source person is legally risky – personality rights, trademark rights (for famous people), EU AI Act in May 2026. Anyone deploying a cloned voice should (a) have written consent and (b) clearly declare the synthetic nature.

For Swiss German TTS there is no commercial provider in May 2026 – not even ElevenLabs. High German synthesis is the pragmatic path.

For very large audio volumes (10M characters/month and beyond) ElevenLabs gets expensive. Here a switch to Azure TTS (often about half the per-character price) or self-host with Coqui pays off – at lower quality.

For real-time applications with sub-200 ms requirements, even turbo-v2.5 is borderline – for ultra-low-latency voice bots, Piper local with own GPU is faster, at lower quality.

Trade-offs

STRENGTHS

Best TTS voice quality on the market in May 2026 – almost indistinguishable from speaker recording
turbo-v2.5 with sub-400 ms latency for real-time telephony
32 languages, mature High German and English stock voices
Voice cloning and pronunciation dictionaries for professional applications

WEAKNESSES

Proprietary, no open-source counterpart, no SME-affordable self-host
Generation in US cloud – transfer-impact assessment mandatory for client data
Voice cloning without consent legally risky under revDSG and EU AI Act
At high volumes (10M chars/month) more expensive than Azure TTS or self-host

FAQ

Can I clone my own voice?

Yes, from the Creator tier (USD 99/Mo). Instant Voice Clone with 30 seconds of audio, Professional Voice Clone with several hours of recordings. Legally: only with documented consent from the source person, and the synthetic nature must be declared on use – otherwise personality rights infringement and EU AI Act risk in May 2026.

How much audio can I generate with USD 99/Mo?

Creator tier: 100,000 characters, that is about 100-120 minutes of spoken audio. Pro tier: 500,000 characters, about 8 hours. Scale tier (USD 330/Mo): 2M characters, about 33 hours. For more demand, enterprise contract or switch to Azure TTS (cheaper per character).

Is ElevenLabs GDPR-compliant for client data?

Conditionally. A BDPA with ElevenLabs is available, but audio generation runs in the US. For client data under professional secrecy (fiduciary, lawyer, doctor) a transfer-impact assessment is mandatory and the result is often negative. Pragmatic: get client consent to cloud transfer of the spoken reply at call start, or use Piper local as fallback.

turbo-v2.5 or multilingual_v2?

turbo-v2.5 for real-time telephony and voice agents – sub-400 ms latency, acceptable quality. multilingual_v2 for studio production (audiobooks, podcast intros, learning videos) – best quality, higher latency (1-2 seconds). Pragmatic: pilot both per application and run a listening test.

Sources

ElevenLabs – model overview (turbo-v2.5, multilingual-v2, flash) · 2026-05
ElevenLabs pricing – Starter, Creator, Pro, Scale tiers · 2026-05
ElevenLabs API reference – text-to-speech, streaming, voice cloning · 2026-04
ElevenLabs Trust Center – security, GDPR, SOC 2 · 2026-04
Artificial Analysis – Text-to-Speech quality benchmark · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call