VOICE · SERVICE

Voice agent on the phone: AI that calls and is called

Phone agent with Whisper STT, LLM, and ElevenLabs/Cartesia TTS. Call answering, appointment booking, pre-qualification. Latency budget under 800 ms. Flat fee CHF 3,500.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is a voice agent?

A voice agent is a telephonic counterpart that talks with callers – understanding, thinking, answering. Three components work together: speech-to-text (STT) converts speech to text, a language model handles the request, text-to-speech (TTS) returns the answer as natural voice. Telephony itself runs over classic SIP/PSTN providers like Twilio or Vonage, which since early 2024 offer voice-AI programmable API endpoints with native streaming-audio support.

As of May 2026 the tech is production-ready for well-bounded applications. Voices are no longer robotic – ElevenLabs Flash and Cartesia Sonic deliver human-sounding speech with first-audio-frame latency under 250 ms. Whisper Large v3 transcribes Swiss German into High-German form with high accuracy; deep dialects (Bärndütsch, Walliser) suffer in quality, but standard speech is clean.

As a service from us: you name the use case (out-of-hours call answering, pre-qualification at an insurer, appointment booking in a practice), we build the voice agent in 3–4 weeks, test with sample calls, route it to a dedicated or your existing number. Flat fee: CHF 3,500.

Why it matters

The phone is not dead in 2026 – in many industries it is the first contact. A fiduciary practice gets calls, a doctor gets calls, a tradesperson certainly does. And phone calls have three properties that make them expensive. First: synchronous – whoever does not pick up loses the call. Second: human-binding – whoever picks up cannot do anything else this minute. Third: unevenly distributed – peaks often fall during the workday when staff are already overloaded.

A voice agent does not solve the phone problem entirely – it handles the peaks. Out-of-hours calls are taken (instead of lost), standard matters (appointment, status, first information) are handled directly (instead of callback ping-pong), complex cases are pre-qualified and routed to the right person (instead of being transferred four times).

Numbers from real implementations 2025–2026: 30–50 % of inbound calls in a fiduciary practice can be fully closed by the voice agent. Another 30 % are prepared so that the human callback takes half the time. The remaining 20 % go straight to humans – faster than today, because the bot already captured name, client number and matter.

The latency budget is decisive for acceptance. If reply latency exceeds one second, the conversation feels alien. Our target budget: under 800 ms from end of caller speech to start of bot reply – streaming STT plus streaming TTS plus a fast model (GPT-4o or Claude-Haiku) make this reliable in May 2026.

How we build it

The voice agent has five stations: telephony, STT, orchestration, LLM, TTS. Each is swappable – we recommend two stable options per station.

Telephony: Twilio or Vonage. Both offer Programmable Voice with media streams: the call is routed to our server as a bidirectional WebSocket audio stream. You keep your Swiss number (porting) or get a new one from us.

STT: Deepgram Nova-2 as default (cloud, 200 ms latency, good DE quality) or Whisper Large v3 locally via faster-whisper on GPU for privacy-sensitive cases. Streaming mode always on: tokens arrive every 100 ms, not only after the sentence ends.

Orchestration: n8n or a lean Python asyncio server (LiveKit Agents, Pipecat). Here lives the conversation state machine: what has been asked, what still needs clarifying, when to hand over to a human. The system holds no long context – one defined script with branches per call.

LLM: Through the LiteLLM gateway. For pure routing decisions Claude-Haiku or GPT-4o-mini (50–100 ms), for more complex cases (light advice, client FAQ via RAG) GPT-4o or Claude-Sonnet (200–400 ms). The choice is set per node.

TTS: ElevenLabs Flash v2.5 or Cartesia Sonic. Both deliver first audio bytes in under 250 ms and stream on while the language model is still generating. The voice is chosen once (May 2026: about 30 German voices with good Swiss timbre available) and stays constant.

Escalation and handover: as soon as the state-machine branch says "human" – be it low confidence, sensitive topic, or caller request – the call is forwarded smoothly to a configured number. Twilio can do this as blind or warm transfer. We recommend warm: the human receives a short summary before being connected.

Everything is transcribed and logged in Postgres – seven days raw for quality review, then pseudonymised. Telegram alert on failures, latency spikes, or escalation share above threshold.

From use case to go-live

01Use-case workshop (half a day): list of matters, draw state machine, define escalation points, write greeting and farewell.
02Telephony setup: Twilio or Vonage account, number (new or port your Swiss number), media-stream TwiML pointed at our server.
03STT/TTS choice: Deepgram or Whisper, ElevenLabs Flash or Cartesia Sonic. Pick the voice from test samples, record probe texts.
04Build orchestration: n8n or Pipecat server with state machine, LLM nodes via LiteLLM gateway, CRM and calendar integration, warm-transfer logic.
05Latency tuning: end-to-end measurement, streaming at every station, fast model for routing, slower only when needed. Target under 800 ms.
06Test operation 3 weeks: 30–50 probe calls from us plus real calls in a soft launch. Walk through transcripts, fold in edge cases.
07Go-live: flip the switch, inform management, 30-day guarantee active. 90 minutes training for 2–3 internal staff.

When to use

A voice agent is the right choice when (a) inbound calls arrive at high frequency from a manageable set of topics, (b) the matter is clarifiable in 60–120 seconds, (c) you should be reachable outside business hours.

Concrete use cases: out-of-hours call answering at a fiduciary practice (bot captures matter, sends email summary to the responsible person, calls back automatically the next business day with an appointment proposal); appointment booking at a medical practice (bot asks name, date of birth, category, checks availability, books, sends SMS confirmation); pre-qualification at an insurer (bot takes claim first-notice, enriches in CRM, routes to the right claims unit); first-line information at an SME (bot answers standard questions like opening hours, address, order status; everything else is pre-qualified and forwarded).

The flat CHF 3,500 covers: one use case, setup on Twilio or Vonage, LiteLLM integration, Whisper/Deepgram STT, ElevenLabs/Cartesia TTS, n8n orchestration, integration with CRM or calendar system, 3 weeks test operation with logs, 90 minutes training. Multilingualism (DE/FR/IT/EN) is an add-on module.

When not to use

A voice agent is the wrong choice when conversations are emotional or require explanation. Bereavement calls at an undertaker, crisis intervention, first oncology consultations – here a bot annoys more than helps, even with a natural voice.

It is also wrong at very low volumes. If your practice gets 5 calls a day, setup is not worth it – the staffer spends about 30 minutes for 5 calls, and the bot imposes a logic designed for 50.

Be careful with deep Swiss German. Whisper Large v3 handles standard High German and light Swiss tinting well but reaches limits with deep Bärndütsch or Valais dialect. Anyone working in a region where callers speak deep dialect must test with real samples or use a greeting that nudges toward standard German.

The voice agent is not suited to regulated advice. Investment advice on the phone falls under FINIG/FIDLEG rules; not doable with a bot, except in pre-qualification mode without advisory character. The same applies to medical professions and legal advice. We deliberately position the bot on information and triage, not advice – and say so in the greeting.

Trade-offs

STRENGTHS

Reachability out of business hours without staff
Latency under 800 ms – conversation feels not artificial
Escalation to humans with context (name, client, matter already captured)
Audit trail via transcripts – what the bot said is verifiable

WEAKNESSES

STT quality drops on deep Swiss German – pre-test recommended
Emotional conversations are unsuitable for bots – keep scope tight
Ongoing cloud cost EUR 50–90 per month at moderate volume
EU AI Act Art. 18 requires transparency: callers must know it is a bot

FAQ

What does a call cost ongoing?

Twilio telephony in CH: about EUR 0.03 per minute inbound. Deepgram STT: EUR 0.004 per minute. LLM (Claude-Haiku mix): EUR 0.005 per call. ElevenLabs Flash TTS: EUR 0.06 per 1,000 characters – typical call ~500 characters output, so EUR 0.03. Total per 2-minute call: EUR 0.10–0.18. At 500 calls/month that is EUR 50–90 in running cloud cost.

How does the voice sound – like a robot?

No, not anymore. ElevenLabs Flash v2.5 and Cartesia Sonic produce natural-sounding German voices with intonation and pauses. We let you pick from 20–30 sample voices. Listeners often cannot immediately tell whether it is human or bot – we still recommend declaring it clearly in the greeting: "You are speaking with an automated assistant." Beyond fairness, this is mandated by Art. 18 EU AI Act (applicable since August 2026).

What happens with deep dialect?

Whisper recognises deep Swiss German unreliably. Three strategies: (1) The greeting explicitly nudges to High German ("Guten Tag, ich bin der digitale Assistent der Praxis X. Bitte sprechen Sie etwas langsamer und in Hochdeutsch, falls möglich"). (2) On repeated STT failure the bot escalates to a human – no spiral of misunderstandings. (3) In regions with high dialect frequency we recommend voice agent only after a 1-week recording phase in which we transcribe real calls and decide whether the use case holds.

Who is liable if the bot says something wrong?

The bot operator – that is you. That is why scope should stay narrow: information and triage, no advice. We build the answers so the bot explicitly refers to a human on uncertainty and transcribes everything it says. That gives you an audit trail. In FINMA, legal-profession or medical-profession contexts we recommend legal review of the greeting text upfront.

Sources

OpenAI – Whisper documentation (Large v3, multilingual) · 2026-04
ElevenLabs – Pricing & Flash v2.5 model (May 2026) · 2026-05
Cartesia – Sonic streaming TTS model · 2026-04
Twilio – Programmable Voice & Media Streams · 2026-03

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call