BOT & VOICE · TOOL COMPARISON
Building blocks for chat and voice bots compared: Whisper, Deepgram, ElevenLabs, Piper, Twilio, Vapi, Retell, WhatsApp, Rasa, Botpress
Ten building blocks for voice and chat bots. STT, TTS, telephony, voice-AI platforms, and chatbot frameworks compared directly. As of May 2026.
Researched & fact-checked by: DuneDive LLC · As of: 2026-05
What is this about?
A modern voice or chat bot is not a single product but a chain of building blocks. For a voice agent on the phone, the chain typically runs: incoming call (Twilio) -> speech-to-text (Whisper or Deepgram) -> language model (Claude or GPT) -> text-to-speech (ElevenLabs or Piper) -> outbound audio (Twilio). Every block is swappable, every choice has consequences for latency, cost, voice quality, and compliance.
The ten building blocks here cover four families: speech-to-text (Whisper, Deepgram), text-to-speech (ElevenLabs, Piper/Coqui), telephony and voice platforms (Twilio, Vapi, Retell AI), messaging APIs (WhatsApp Business, Telegram), and chatbot frameworks (Rasa, Botpress). Some are direct replacements, others complement each other.
For a Swiss SME the critical questions are: can the blocks be hosted in EU/CH, is Swiss German understood, does latency stay below 500 ms, and is per-minute cost acceptable? As of May 2026, Whisper (locally via whisper.cpp) is the only STT service that reliably handles Swiss German. Deepgram is English-dominated; ElevenLabs delivers the best voices but sits in US regions.
Why it matters
Four axes decide fit for the Swiss market: language, latency, telephony, and data residency.
Language: Swiss German is the hardest test for any STT system. Everyone handles High German, but Swiss German fails on Deepgram and Google STT – those models trained on English data. Whisper (large-v3, May 2026 with turbo variant) handles Swiss German surprisingly reliably, because the training included multilingual YouTube data. For a fiduciary or insurance office in Zurich this is the only practical path.
Latency: in a phone call, every pause feels off. A human expects an answer in under 1 second. A STT+LLM+TTS chain must stay below 2 seconds, ideally below 1.5 seconds. Deepgram is the fastest cloud STT at under 300 ms latency. ElevenLabs turbo-v2.5 delivers TTS under 400 ms. Whisper locally on well-equipped hardware (RTX 4090) hits about 800 ms – borderline for live telephony, fine for recordings.
Telephony: Twilio is the global standard for programmable telephony. Voice-AI platforms like Vapi and Retell are Twilio wrappers with built-in STT+LLM+TTS chain. They cut setup effort to a few hours, in exchange for roughly 30-50 percent per-minute markup and US hosting.
Data residency: WhatsApp Business belongs to Meta, Telegram runs its own infrastructure outside the EU. Whoever processes client data via WhatsApp must use the Cloud API directly from Meta and sign a data-processing agreement. Telegram is generally not acceptable for Swiss professional-secrecy data.
The ten building blocks in detail
Whisper (STT): OpenAI model from 2022, improved multiple times. Models MIT-licensed, the inference API costs USD 0.006/min. Runs locally via whisper.cpp (CPU only) or faster-whisper (GPU). In May 2026, large-v3 and the turbo variant are standard – turbo is 8x faster at only marginally lower recall. The only system with reliable Swiss German recognition.
Deepgram: proprietary STT API from the US. Best latency in the market (under 300 ms), USD 0.0043/min with volume discount. Very strong for English and High German, weak for Swiss German and dialects. No EU tier in May 2026 – data flows through US servers. First choice for English-language voice agents.
ElevenLabs (TTS): US vendor (with multi-region hosting). May 2026 the industry reference for natural voices. Starter plan USD 5/month for 30 minutes of generated audio, higher plans for voice cloning. turbo-v2.5 delivers TTS under 400 ms latency – making real-time telephony viable. German voices are excellent, voice cloning legally tricky.
Coqui / Piper (local TTS): open-source alternatives. Piper (MIT) is slimmer and runs on a Raspberry Pi 5. Coqui (MPL-2) brings more voice variants. Both are solid for DE-TTS but the voice quality stays behind ElevenLabs – good for internal tools, less for customer contact. Free and fully local – an important advantage for professional-secrecy data.
Twilio: global telephony standard. A CH landline minute costs about USD 0.0085, SMS about USD 0.075. Programmable via TwiML or Voice SDK. Stable, well documented, integrated everywhere. First choice for serious voice agents – but you must orchestrate the chain (STT+LLM+TTS) yourself.
Vapi: US voice-AI platform in May 2026. A wrapper around Twilio with built-in STT+LLM+TTS chain. Setup of a voice agent in under an hour. Price roughly USD 0.05/min plus the underlying costs (LLM, STT, TTS). US hosting, tricky for CH client data.
Retell AI: similar to Vapi, US vendor with focus on voice agents for sales and support. In May 2026 in the same league as Vapi, slightly different tooling depth. Both aim at "voice agent in 30 minutes" – good for prototypes, less for strict-compliance production.
Telegram / WhatsApp Business API: WhatsApp Cloud API direct from Meta is standard for business messaging – content is not end-to-end encrypted between bot and customer but lands at Meta. CH fiduciaries should sign a DPA with Meta and avoid sending client data via WhatsApp. Telegram is free, runs its own cloud, but compliance-wise unclear – not the first choice for professional bots.
Rasa: open-source chatbot framework from Berlin (now German-American). Pre-LLM architecture with intents, entities, stories. May 2026 also LLM-capable via Rasa Pro, but the classic setup is relatively involved. Worth it if you already have a Rasa system or need strict rule-based flow – otherwise the concepts feel outdated in 2026.
Botpress: modern chatbot framework with LLM integration at its core. AGPL-3 for self-hosting, cloud variant with pay-as-you-go. In May 2026 a good middle ground between Rasa (too classic) and a raw LLM wrapper. Visual flow builder, integrations to WhatsApp/Telegram/Slack/SMS, multi-channel bots within days.
Selection workflow in 6 steps
- 01Choose modality: phone (voice) vs chat (text). Phone needs STT+TTS+telephony, chat only bot logic.
- 02Language requirement: Swiss German expected? If yes, deploy Whisper local for STT, drop Deepgram.
- 03Clarify data residency: client data in CH/EU? If yes, drop Vapi/Retell, build a custom Twilio pipeline on Hetzner.
- 04Measure latency budget: target under 1.5 s end-to-end. Whisper turbo instead of large-v3, ElevenLabs turbo-v2.5 instead of v2 for fastest reply.
- 05Estimate volume: calls per day x minutes per call = monthly cost baseline. Below 30 calls/day, a voice agent is often unnecessary.
- 06PoC on real client cases: one week in shadow mode, measured against human triage. Only go production after the comparison.
Recommendation by use-case
Phone voice agent for Swiss SMEs, Swiss German expected: Twilio + Whisper local (faster-whisper on GPU) + Claude/GPT + ElevenLabs turbo-v2.5. Local Whisper handles the Swiss German part, ElevenLabs delivers natural voice. Latency under 1.5 seconds achievable. Setup effort 5-10 days.
English-language voice agent without CH ties: Vapi or Retell. Fast start (one hour), standard pipeline, good voices. Worth it when no strict data residency is needed.
Client phone reception with IVR dispatch: Twilio directly with your own logic. Incoming call -> Whisper STT -> category classification via LLM -> route to the right department. No off-the-shelf voice platform but Express/Node logic in-house.
WhatsApp bot for appointments, intake, status queries: WhatsApp Business Cloud API direct from Meta + Botpress as bot logic + multi-LLM gateway behind. DPA with Meta required, keep client data separated.
Pure voice-notes app for lawyers, local processing: Whisper local (large-v3 or turbo) on a workstation. Recordings stay on premise, no cloud upload. Coqui/Piper if TTS reply is needed.
Telegram bot for internal tooling or hobby projects: Telegram Bot API directly, no Botpress needed. Free, fast iteration, not suitable for client data.
Rule-based chatbot without LLM (e.g. FAQ bot without hallucination risk): Rasa or Botpress with the LLM layer disabled. Controllable, predictable, audit-safe. Rarely the first reflex in May 2026, but still relevant for regulated industries.
When these tools are wrong
If you expect fewer than 5 calls per day, a voice agent is usually the wrong investment – a human-staffed phone plus email triage is set up faster and cheaper. Voice agents pay off from about 30-50 calls per day, where staff effort becomes noticeable.
Deepgram is the wrong choice when Swiss German or dialectal German is expected – recognition drops sharply. Also when data must stay in the EU, Deepgram is not first choice in May 2026.
ElevenLabs voice cloning is legally tricky: cloning a real voice without consent violates personality rights and can become problematic under new EU AI Act rules. Pure synthetic voices (e.g. ElevenLabs in-house stock voices) carry no such conflict.
Vapi and Retell are the wrong choice for production systems with strict data residency – both are US-hosted without clear EU options in May 2026. Also limited for complex multi-step voice flows with tool use (e.g. book appointment + send mail + update DB); there a custom Twilio build pays off.
Rasa is the wrong choice for new projects in 2026 – the classic intent-and-stories architecture is barely competitive with an LLM-based bot in build speed and flexibility. Only justified when Rasa is already running or when the deterministic rule flow is strictly required.
WhatsApp Business Cloud API is the wrong choice when you are not willing to sign a data-processing agreement with Meta and separate client data. For professional-secrecy content (lawyer, doctor, fiduciary), WhatsApp is only suitable for organisational messages – case content does not belong in the WhatsApp channel.
Trade-offs
STRENGTHS
- Whisper local: the only practical solution for Swiss German, MIT-licensed models, free model license
- ElevenLabs turbo-v2.5: most natural voices on the market, under 400 ms latency
- Twilio: global telephony standard, stable, well documented
- Botpress: modern chatbot builder with LLM integration, visual flow
- Vapi/Retell: voice agent set up in an hour for English without compliance pressure
WEAKNESSES
- Deepgram: weak at Swiss German, no EU tier in May 2026
- ElevenLabs: US hosting, voice cloning legally tricky
- Vapi/Retell: US-hosted, difficult for CH professional-secrecy data
- Rasa: pre-LLM classic, barely competitive for new builds in 2026
- WhatsApp: data at Meta, only suitable for organisational messages
FAQ
Does Whisper really understand Swiss German well?
Surprisingly well, yes. large-v3 reliably recognises mid and moderate dialects (Zurich, Bern, Basel). Very strong dialects (Wallis, deep Bernese Oberland) stay difficult. High German recordings are virtually flawless. We have several CH clients running Whisper in production – as of May 2026 it is the only practical solution for dialectal voice input.
What does a voice agent cost per call?
For a 3-minute average, around USD 0.10-0.25 per call in the standard cloud chain: Twilio USD 0.025, Whisper API USD 0.018, LLM (Claude Haiku or GPT-4o-mini) USD 0.02-0.05, ElevenLabs turbo USD 0.05-0.15. With Vapi or Retell add 30-50 percent. Local Whisper drops STT cost to zero but needs a GPU machine for roughly CHF 80-150/month.
Do I still need Rasa?
Rarely. By May 2026 Rasa is barely competitive for new projects – an LLM-based bot with a clear system prompt is built in a day, while Rasa with intents, stories, and NLU training takes weeks. Rasa has niche value when deterministic answers are mandatory and hallucinations must be absolutely excluded – e.g. regulated FAQ bots in finance. Otherwise: Botpress or a direct LLM wrapper.
May I use WhatsApp for client contact?
Only for organisational messages (appointment confirmation, reminder, general question). Case content does not belong in WhatsApp – messages are not end-to-end encrypted between bot and client but flow through Meta servers. Required: DPA with Meta, note in the mandate contract, clear separation "WhatsApp = organisation, email/portal = content". Professional secrecy (Art. 321 SCC) demands this.
Related topics
Sources
- OpenAI Whisper – large-v3 and turbo model card · 2026-04
- Deepgram – Pricing and latency benchmarks · 2026-04
- ElevenLabs – turbo-v2.5 announcement and pricing · 2026-05
- Twilio Programmable Voice – pricing CH/EU · 2026-04
- Vapi documentation – voice AI platform · 2026-05
- Botpress – open-source chatbot framework · 2026-04
- WhatsApp Business Cloud API – overview · 2026-03