DEEPGRAM · TECH

Deepgram: proprietary STT API with the lowest latency in the market

Deepgram offers speech-to-text as a US cloud API at USD 0.0043/min Nova-2 with sub-300 ms latency. Strong on English, weak on Swiss German, no EU tier in May 2026.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Deepgram?

Deepgram is a speech-AI platform founded in San Francisco in 2015, offering speech recognition as a cloud API. The company is proprietary, has been funded multiple times (Series C in June 2024 for USD 72M, valuation USD 1B+) and specialises in low latency. The current main model is Nova-2 (May 2026 in GA), the premium model Nova-3 is rolling out. Both run only in Deepgram cloud – no self-host model, no open-source counterpart.

The unique selling point is speed. Nova-2 returns pre-recorded audio in under a third of audio duration (e.g. 30-second clip in 8-10 seconds) and streaming audio at sub-300 ms latency from last word to transcript output. That makes Deepgram the fastest cloud STT in May 2026 – faster than Whisper API (2-5 seconds) and faster than Google STT (around 800 ms).

Pricing is usage-based. Nova-2 costs USD 0.0043/min for pre-recorded and USD 0.0058/min for streaming. Volume discounts start at USD 1000/month. Free tier USD 200 starting credit, no credit card needed. Extras like diarisation, smart format (date recognition, phone number formatting), topic detection, summarisation and redaction (auto-blanking of PII) are query-parameter toggles – usually 10-30 percent surcharge.

Hosting is May 2026 exclusively US cloud (AWS us-east-1 as primary region). No EU region, no Frankfurt, no Dublin. For GDPR-bound Swiss data this is a clear no-go without explicit client consent. The offered "on-prem" tier (Deepgram running in your own infrastructure) starts at six-figure annual contracts and is unrealistic for SMEs.

Why it matters

For voice agents, latency is the deciding experience factor. A human expects a phone reply in under one second. A STT+LLM+TTS chain must split that second – typical split: 300 ms STT, 400 ms LLM, 300 ms TTS. Whisper API with 2-5 second STT latency drops out of this maths. Deepgram with 250-300 ms leaves room to stay under the sub-second threshold.

For voice bots speaking predominantly English – international customer-service hotlines, SaaS onboarding bots, sales cold outreach – Deepgram is the technically best choice in May 2026. Word error rate for US English below 5 percent, British English under 7 percent, Indian English under 12 percent. Most competitors (including Whisper) hit comparable numbers, but Deepgram is faster.

For Swiss applications the picture flips. Swiss German fails on Deepgram – WER sits at 60-80 percent, the model was not trained on Alemannic data. High German works (about 8-12 percent WER), Swiss German does not. Anyone building a voice agent for a Zurich fiduciary must use Whisper (local) instead of Deepgram.

The data argument is equally critical. Deepgram runs only on US AWS. Every request physically goes to Virginia. For client calls at a CH fiduciary or law office this is not legal without explicit consent. The Deepgram trust center sheet covers SOC-2 and ISO-27001, but no Swiss data-protection certification. A BDPA with Deepgram is signable, but the revDSG demands a transfer-impact assessment (TIA) for US transfers.

How it works

Deepgram exposes two interfaces: pre-recorded (REST POST /v1/listen) and streaming (WebSocket wss://api.deepgram.com/v1/listen). Both use the same model, but streaming sends partial transcripts every 100-300 ms, with the final transcript only after a pause or explicit close.

Example pre-recorded cURL:

curl --request POST \ --url "https://api.deepgram.com/v1/listen?model=nova-2&language=en&smart_format=true&diarize=true" \ --header "Authorization: Token $DEEPGRAM_API_KEY" \ --header "Content-Type: audio/mp3" \ --data-binary @audio.mp3

Response is JSON with word timestamps, confidence, speaker labels, topic detection, and a punctuated_word list for smart format. Streaming similar via WebSocket – audio chunks sent as binary frames, transcript received as JSON.

For Twilio integration: Twilio Media Streams provides a WebSocket stream with u-law PCM audio. A Node server takes the Twilio stream, forwards it to Deepgram, receives transcripts, sends them to the LLM, which generates a reply, converts via TTS to audio, and sends back to Twilio. Deepgram offers ready-made voice-agent SDKs in JavaScript and Python that pre-build this loop.

The Nova-2 family covers specialised variants: nova-2-general (standard), nova-2-meeting (conference audio), nova-2-phonecall (telephonic compressed 8 kHz audio), nova-2-medical (clinical terms), nova-2-finance (stock and bank jargon). The phonecall variant is the right choice for voice agents over Twilio – high-frequency content is compressed away and general models do not handle that well.

Extra endpoints: /v1/auth for token refresh, /v1/projects for multi-tenant setups, /v1/keys for sub-keys with limits, /v1/usage for consumption data. Deepgram also offers Aura (TTS) and Eve (voice-agent wrapper) as separate products – both less mature in May 2026 than ElevenLabs (TTS) or Vapi (voice wrapper).

Deepgram setup in 5 steps

01Sign up at deepgram.com, get USD 200 starting credit, generate first API key. Review BDPA and trust-center docs and write a transfer-impact assessment.
02Choose a model variant: nova-2-general for standard, nova-2-phonecall for Twilio integration (8 kHz telephony audio), nova-2-meeting for conference recordings.
03Pre-recorded pilot: process 20-30 real audio samples from the target domain via REST API, measure WER and answer quality. Toggle smart format, diarisation and redaction as needed.
04Streaming integration: implement WebSocket client or use official SDK (Node, Python, Go). Wire Twilio Media Streams as audio source, route transcript stream to the LLM.
05Monitoring and cost: poll /v1/usage regularly, set budget alarms (e.g. USD 100/day limit), latency metrics in Grafana, sub-keys per client or application for cost separation.

When to use Deepgram

Deepgram is the right choice when (a) audio is predominantly English, (b) latency under 400 ms is required and (c) US cloud hosting is acceptable.

Concrete cases: a SaaS vendor builds an English-language voice agent for demo bookings – Deepgram streaming delivers sub-second latency. A call centre analyses calls after the fact for QA and topic trends – Deepgram pre-recorded with smart format and diarisation. An international podcast vendor transcribes hundreds of episodes for SEO and search – Deepgram pre-recorded with topic detection.

Deepgram also works for multilingual setups with High German share, as long as Swiss German is not in play. For example a DE-AT helpdesk line without CH clients.

When not to use

When Swiss German must be understood, Deepgram is the wrong choice in May 2026. WER of 60-80 percent is unusable – Whisper local is 3-4x better.

When client data (fiduciary, legal, medical) is in play with no explicit consent: Deepgram as US cloud is data-protection-sensitive. A transfer-impact assessment is mandatory and the result for client audio is usually negative.

When latency is not critical (batch transcription, recordings, asynchronous analysis), the OpenAI Whisper API is cheaper (USD 0.006 vs. 0.0043/min is close) and delivers comparable English quality. Whisper has the extra advantage that the model is open and runs locally – no lock-in if you switch providers.

For self-host requirements Deepgram is not practical – the on-prem tier is six figures per year and reserved for enterprise. SMEs should use Whisper locally.

For multilingual auto-detection across > 5 languages, Whisper is stronger – Deepgram covers 30+ languages but auto-detect on code-switching is less robust.

Trade-offs

STRENGTHS

Lowest latency in the market – sub-300 ms streaming for real-time voice agents
English WER under 5 percent, very good smart format and diarisation
Scalable cloud API without hardware overhead
Nova-2-phonecall variant specifically tuned for telephony audio

WEAKNESSES

Swiss German and dialects unusable at 60-80 percent WER
No EU tier in May 2026 – all data through US AWS
Proprietary, no affordable self-host option for SMEs
STT only, additional building blocks needed for a full voice-agent pipeline

FAQ

Does Deepgram handle Swiss German?

No. Swiss German in May 2026 sits at 60-80 percent WER on Deepgram – practically unusable. High German works (8-12 percent WER), Swiss German does not. For CH applications with dialect content, Whisper local is the only solution.

Is there an EU region?

No, not in May 2026. Deepgram runs only in AWS us-east-1. An EU region has been announced for years but is still not available. For GDPR/revDSG-bound data, a transfer-impact assessment and client consent are mandatory.

How much does one hour of audio cost?

Nova-2 pre-recorded: USD 0.0043 * 60 = USD 0.26 per hour. Nova-2 streaming: USD 0.0058 * 60 = USD 0.35 per hour. With diarisation, smart format, topic detection add 10-30 percent. At 500 hours/month that is about USD 130-175 – cheaper than Whisper API (USD 180/month) and faster.

How does Deepgram cope with poor audio quality?

Nova-2-phonecall is trained specifically on compressed telephony audio (8 kHz) and delivers better results than the general model. Under heavy background noise (restaurant, street) WER drops on all models – Whisper large-v3 is more robust there because it was trained on broader audio data.

Sources

Deepgram – Nova-2 model overview and benchmarks · 2026-05
Deepgram pricing – pre-recorded and streaming tiers · 2026-05
Deepgram API reference – listen endpoint, WebSocket streaming · 2026-04
Artificial Analysis – Speech-to-Text benchmark (latency leaderboard) · 2026-05
Deepgram Trust Center – SOC 2, ISO 27001 attestations · 2026-04

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call