Word Error Rate (WER)

Word Error Rate (WER) is the standard metric for measuring the accuracy of an automatic speech recognition (ASR) system. It counts the number of word-level errors a transcription contains relative to a reference transcript, expressed as a percentage. A WER of 0% means perfect transcription; higher values indicate more errors. WER is widely used because it maps directly to downstream task quality — a system with 15% WER will generate noticeably garbled text that breaks intent detection, while one below 5% is generally transparent to users.

Consider a reference transcript: "Cancel my subscription starting next month." If an ASR system produces "Cancel my description starting next month," that is one substitution error out of six reference words, giving a WER of 16.7%. The same system might also drop words entirely (deletions) or hallucinate words that were never spoken (insertions). All three error types count.

The WER formula

WER = (S + D + I) / N

Where S is the number of substitutions (wrong word), D is the number of deletions (missed word), I is the number of insertions (extra word added), and N is the total number of words in the reference transcript. The alignment between hypothesis and reference is computed using dynamic programming — specifically, a minimum edit distance algorithm that finds the lowest-cost sequence of operations to transform one string into the other.

WER can technically exceed 100% when insertions are numerous. This happens when the model hallucinates long strings of text from short utterances, so a raw WER above 100% is a signal of a fundamentally broken system rather than a mildly poor one.

Benchmarks and what the numbers mean

Interpreting WER requires context about acoustic conditions, vocabulary size, and speaker demographics. Broadly accepted benchmarks:

Below 5%: Consumer-grade quality. Acceptable for most transcription and voice assistant applications where the acoustic environment is clean.
2-5%: Typical for frontier cloud ASR systems on standard read speech benchmarks such as LibriSpeech test-clean.
Below 2%: Approaches human parity in controlled conditions. Achieved on some benchmark sets by leading ASR models, though human transcriptionists generally still outperform machines on conversational speech with overlapping speakers.
10-20%+: Degraded performance. Common when models encounter heavy accents, domain-specific jargon, noisy call-center audio, or languages underrepresented in training data.

WER on read speech benchmarks does not translate directly to production performance. A model scoring 2% WER on a clean academic corpus might reach 12-18% on contact center audio recorded over VoIP, where background noise, codec compression, and spontaneous speech patterns all degrade accuracy.

What drives WER up

Several factors push WER higher in production deployments. Acoustic noise — HVAC, keyboard, hold music bled through headsets — masks phonetic distinctions the model relies on. Codec artifacts from telephony (G.711 encoding at 8 kHz) strip high-frequency consonant information that helps distinguish similar-sounding words like "shift" and "sift." Speaker diversity also matters: models trained predominantly on one dialect or accent cluster will generalize poorly to other speaker populations.

Vocabulary mismatch is another common cause. A general-purpose ASR model may lack the pronunciation dictionary entries for product names, medical terminology, or company-specific jargon. When the model encounters an out-of-vocabulary word, it substitutes the phonetically nearest in-vocabulary word, adding a substitution error. Domain adaptation — either fine-tuning the acoustic model or updating the language model with domain text — is the standard remediation.

WER in voice AI pipelines

In a voice AI pipeline — where ASR output feeds into intent classification, entity extraction, and ultimately AI voice agent logic — WER errors compound downstream. A substitution error that changes "return" to "retain" will cause the intent classifier to route the call incorrectly even if every downstream component is perfect. For this reason, production voice AI teams often track both WER and task completion rate as paired metrics: WER captures transcription quality; task completion captures whether errors matter in practice.

Endpointing accuracy also interacts with WER. When an endpointing model cuts the audio stream too early, the ASR never receives the final words of an utterance, creating systematic deletion errors that inflate WER on long sentences. Teams optimizing WER in a production voice agent typically need to tune both the ASR model and the endpointing thresholds together.

Mean Opinion Score (MOS) is a complementary metric used on the synthesis side of voice AI: while WER measures how well the system understands speech, MOS measures how natural synthesized speech sounds. Monitoring both gives a complete picture of voice channel quality. Voice activity detection (VAD) is a prerequisite stage that determines which audio segments contain speech before ASR even runs — VAD errors that pass noise segments to the ASR model also inflate WER in unexpected ways.