Rethinking How We Can Evaluate AI Models

Today I ran a simple test to see whether OpenAI's Whisper could handle McDonald's drive-thru orders. Recorded three audio files: clear speech, fast speech, casual conversation → ran them through the model, checked the WER.

The first result came back with average WER = 8%. Decent for a base/small model, so not bad!

But I was just curious where it was making errors, so I broke it down by condition:

Condition	WER
Clear speech (each word pronounced clearly)	0%
Fast speech (a little fast with fillers)	20%
Casual conversation (how we usually talk, very less fillers)	5%

One test case was quietly destroying the aggregate. And it happened to be the most important one: fast speech. That’s how people actually talk in real life.. In this case, they are not dictating their order clearly the way a model expects.

"McFlurry" was being transcribed as "neck laddie." "Coke" became "cook." These weren't small errors.

What did I do next?

I tried to fix it with a fuzzy matching post-processor that would auto-correct words close to menu items. The result: 8% → 8%. There was zero improvement because when the transcription is "neck laddie," no similarity can make it "McFlurry."
So I used claude to understand more options and came across- intent classifier. Instead of matching characters, it takes the raw transcription and asks: given this, what menu item was most likely ordered? Using embeddings, it could map "neck laddie" → "McFlurry" because it's reasoning about intent, not similarity.

But here's the thing: WER was still high. Because WER only measures whether the transcribed words match the original words. The intent was right but the words were still wrong. And the metric had no way of knowing the difference.

The model didn't fail because it was weak. It failed because I was optimizing for the wrong success definition.

The Wispr Flow Problem

Wispr Flow is an AI transcription software. Their founder said/mentioned (not exactly but along the lines of this):

“Industry-standard metric for speech-to-text, WER, is misleading for user experience. Even if an AI reaches a 99% accuracy rate (@1% WER), a typical 100-word email will still contain at least one error. For a user, that single error breaks the "flow" and forces them to move their hands back to the keyboard to edit, destroying the trust required for a voice-first interface”

They stopped optimizing for WER and started measuring zero-editing rate: what percentage of transcriptions do users send without changing anything?

Even at 2% WER, users were still editing 40% of outputs. Because spoken language isn't written language. Raw transcription of "Um, so like, can you send me that report?" is technically accurate but unusable to send.

Note: this isn't universal. In medical transcription or legal settings, word-level accuracy isn't just a metric, it's a compliance requirement. The point isn't that WER is always the wrong metric. It's that the right metric depends entirely on what failure actually costs in your specific context.

Here's My Take

Here's the thing: writing evals is increasingly something AI can do. Tools already generate test cases automatically. Models grade other models. Entire eval pipelines can get scaffolded in hours.

The bottleneck will no longer be building the eval.