WHISPER · TECH

Whisper: OpenAI open-source STT model for multilingual transcription

Whisper is OpenAI MIT-licensed speech-recognition model. Runs locally via whisper.cpp, faster-whisper or WhisperX, or via API at USD 0.006/min. As of May 2026 with large-v3 and turbo-v3.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Whisper?

Whisper is a speech-to-text model OpenAI released in September 2022 under MIT license. It was trained on 680,000 hours of multilingual audio from the open web. It transcribes 99 languages, translates audio directly into English, and marks speaker turns via timestamp markers.

The unusual thing about Whisper: the model is simultaneously the reference for cloud STT (OpenAI offers it as an API at USD 0.006 per minute) and the industry standard for local speech recognition. Three variants dominate the field in May 2026. whisper.cpp (Github ggerganov/whisper.cpp) is the C++ port, runs on CPU or Apple Silicon without CUDA. faster-whisper (SYSTRAN) uses CTranslate2 for 4x speedup on GPU. WhisperX (Max Bain, Oxford) adds speaker diarisation, word-level timestamps and forced alignment.

The model family covers tiny, base, small, medium and large. In May 2026, large-v3 is the recall champion (235M encoder parameters, 1.5B total), turbo-v3 is the fast variant (8x speedup at around 95 percent recall). On an RTX 4090, turbo-v3 transcribes one hour of audio in under 30 seconds. On a MacBook Pro M3, whisper.cpp with small.en hits about 5x real-time.

For Swiss applications, Whisper is the only cloud or local STT system with reliable Swiss German recognition. Word error rate (WER) sits below 5 percent on clean High German, between 18 and 45 percent on dialect depending on region. Bernese Oberland and Wallis are the hardest cases, Zurich German and Bern city dialect work well.

Why it matters

Three properties of Whisper matter for Swiss fiduciaries and SMEs.

First: data sovereignty. Whisper runs fully locally. A client call recording or confidential lawyer dictation never needs to touch a US cloud service. whisper.cpp on a workstation or a Hetzner server in Falkenstein processes the audio on-premise, the transcript stays in-house. That makes Whisper the only STT system compatible with Art. 321 StGB (professional secrecy for lawyers, doctors, fiduciaries) when no client consent for cloud transfer exists.

Second: Swiss German. Deepgram, Google Speech-to-Text, and Azure Speech are trained on English-heavy data. Swiss German hits 60-80 percent WER on them – unusable. Whisper large-v3 reaches 18-45 percent WER by dialect – not perfect, but enough for pre-sorting and downstream LLM analysis. For voice agents in Zurich, Basel, and Bern, Whisper is the only practical path in May 2026.

Third: cost control. The OpenAI API costs USD 0.006 per minute of audio. At 100 hours per month that is USD 36. Local with faster-whisper on an RTX 4060 (CHF 350 hardware) the unit cost is practically zero. At larger volumes (1000+ hours/month) self-host is cheaper than the cloud API after 2-3 months. For fiduciary firms with heavy dictation load, a clear argument for local.

A fourth argument: latency. Cloud Whisper returns a transcript in 2-5 seconds for a 30-second audio file. Locally on GPU it is under 1 second. For live telephony with a voice bot, faster-whisper on a GPU is the only practical option.

How it works

Whisper is an encoder-decoder transformer. The encoder turns 30-second audio chunks into Mel spectrograms and encodes them into 1500-dimensional vectors. The decoder generates tokens autoregressively, with language tokens, task tokens (transcribe/translate), and timestamp tokens. Long audio is processed in 30-second windows with overlap.

API usage is simple. Example cURL against OpenAI:

curl https://api.openai.com/v1/audio/transcriptions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F file="@audio.m4a" \ -F model="whisper-1" \ -F language="de" \ -F response_format="verbose_json"

The response holds text, word timestamps, and confidence. Maximum file size is 25 MB – longer audio must be split.

Local with faster-whisper:

from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, info = model.transcribe("audio.m4a", language="de", beam_size=5) for segment in segments: print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")

The model loads around 3 GB VRAM at float16. An RTX 4060 (8 GB) is enough, an RTX 4090 can run several streams in parallel.

For live telephony with Twilio: incoming call -> Twilio Media Streams sends PCM audio via WebSocket -> faster-whisper transcribes in 1-2-second chunks -> LLM request -> TTS reply back. End-to-end latency caller speech to bot answer lands at 1.5-2 seconds with a good setup.

WhisperX adds two critical features: pyannote speaker diarisation (who spoke when) and Wav2Vec2 forced alignment for exact word timestamps. For legal transcripts and multi-speaker client calls, WhisperX is the right variant.

Whisper setup in 5 steps

01Pick a variant: OpenAI API (no setup, USD 0.006/min, US cloud) or local (faster-whisper on GPU, whisper.cpp on CPU/Apple Silicon, WhisperX for diarisation).
02Provide hardware: RTX 4060 (8 GB VRAM) is enough for one large-v3 stream, RTX 4090 for several parallel streams or turbo-v3 batch.
03Load the model: install the faster-whisper Python package, download large-v3 or turbo-v3 (about 3 GB), beam_size=5 for quality, beam_size=1 for speed.
04Build the pipeline: audio source (file, mic, Twilio stream) -> Whisper -> post-processing (punctuation, capitalisation) -> output (database, LLM, file).
05Evaluate quality: run 30 real audio samples from the target domain (Swiss German, High German, speaker variety), measure WER, optionally set language hint and initial prompt.

When to use Whisper

Whisper is the right choice as soon as Swiss German or multilingual audio is in play, data sovereignty becomes a topic, or volume goes above 50 hours/month.

Concrete cases: a fiduciary office wants to transcribe client calls for internal notes – local with faster-whisper, on-premise. A law firm records dictations – Whisper large-v3 locally with WhisperX for speaker separation. A voice agent for an insurance hotline must understand Swiss German – faster-whisper on GPU in live stream. A consultant wants automatic video-call summaries – Whisper API for recordings, then LLM summary.

For English audio (e.g. international calls, YouTube sources, podcast transcription) Whisper is competitive – English WER is below 5 percent.

When not to use

For pure English with tight latency requirements (< 300 ms), Deepgram is faster. Whisper cloud API returns in 2-5 seconds, locally in 0.5-1 second – too slow for some real-time use cases.

For Wallis or Bernese Oberland dialect with high accuracy demand, Whisper is also limited (35-45 percent WER). A human in the loop or fine-tuning on own data helps more.

Without GPU hardware budget and no willingness to use the cloud API, look at Whisper-cpp on CPU carefully – smaller models (small.de) are usable, large-v3 on CPU is very slow (1-2x real-time).

For extremely clean studio recordings with a single speaker and High German, even the small base.de variant is enough – large-v3 is overkill.

Trade-offs

STRENGTHS

Only STT system with reliable Swiss German recognition in May 2026
MIT license, fully runnable locally, no cloud lock-in
Three mature implementations (whisper.cpp, faster-whisper, WhisperX) for every hardware tier
99 languages, robust noise handling, word-level timestamps

WEAKNESSES

Higher latency than Deepgram for pure English (local 0.5-1s, cloud 2-5s vs. Deepgram 0.3s)
large-v3 needs GPU or Apple Silicon, very slow on CPU
Dialects outside urban Swiss German (Wallis, Oberland) remain hard
Hallucinations on very quiet or empty audio – clean VAD pre-filtering required

FAQ

How well does Whisper really handle Swiss German?

With large-v3 in May 2026, between 18 and 45 percent WER depending on region. Zurich, Basel, Bern city: 18-25 percent – usable for pre-sorting and LLM post-processing. Wallis, Bernese Oberland: 35-45 percent – poor, but better than any competitor. Tip: set language to "de" not "gsw" (Whisper knows the Swiss German tag but then transcribes into High German, which is usually what you want).

Cloud API or self-host: what pays off?

Below 50 hours of audio per month: OpenAI API. No setup, USD 0.006/min, scales up. Between 50-500 hours/month: weigh both, depending on data-protection needs. Above 500 hours or under professional secrecy: self-host with faster-whisper on your own GPU. Break-even for an RTX 4090 (CHF 1800) sits at about 250 hours/month of Whisper API usage.

large-v3 or turbo-v3?

For live telephony and bulk transcription: turbo-v3. 8x faster, recall drops only 2-3 percent – acceptable for most cases. For legal transcripts, dictation with high accuracy demand, or hard audio (background noise, multiple speakers): large-v3. Both can run in parallel – turbo for a fast first pass, large-v3 for re-run on low-confidence segments.

What does a self-hosted Whisper service cost per month?

On Hetzner with a GPU server (e.g. GEX130 with RTX 4000 Ada, around CHF 280/month), a single-stream live transcription runs 24/7. Batch transcription processes around 5000 hours of audio per month. A dedicated workstation (RTX 4090, one-time CHF 1800) is cheaper after 6-12 months, but without redundancy.

Sources

OpenAI – Introducing Whisper (model card, large-v3, turbo) · 2026-05
ggerganov/whisper.cpp – GitHub repository and benchmarks · 2026-05
SYSTRAN/faster-whisper – GitHub repository (CTranslate2 inference) · 2026-05
WhisperX (m-bain) – diarisation and forced alignment · 2026-04
Artificial Analysis – Speech-to-Text benchmark leaderboard (German, multilingual) · 2026-05

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call