There are features you build, ship, and feel proud of. Then there are features you spend two weeks on, demo to yourself in the mirror, realize they are completely broken at a fundamental level, and then delete entirely. Speaker identification for VORA was firmly in the second category. This is the honest post-mortem.
The Original Vision
The idea sounded obvious, even necessary, for a meeting assistant: when multiple people are talking, the transcript should know who said what. "Alice said X, Bob said Y." Every enterprise meeting recorder has this. Google Meet has it. Otter.ai has it. How hard could it be?
We committed to it. We built a SpeakerDetector class. We added UI elements: a "Set Primary Speaker" button in the app header, a speaker profiling card on the landing page, and speaker columns in the transcript display. The settings modal had a dedicated speaker detection section. We wrote the pitch copy: "Speaker Profiling — identify and separate voice tracks in real time."
It was in the product. It was in the marketing. It was a lie.
The Fundamental Problem We Ignored Too Long
VORA is built on the Web Speech API — the browser's built-in speech recognition interface. It is the reason VORA requires zero server infrastructure for transcription, works entirely in the browser, and is genuinely free to run. The Web Speech API is elegant and powerful for what it does.
What it absolutely does not do is speaker diarization. The API returns a single stream of text. It has no concept of multiple speakers, no audio channel separation, no voice embeddings, nothing. When you call recognition.onresult, you get a SpeechRecognitionResultList — just text. That's it.
We knew this going in. But we convinced ourselves we could work around it through clever heuristics:
- Volume thresholding: Maybe the "primary speaker" is consistently louder? We could ask the user to set a reference volume level.
- Pause-based segmentation: Long pauses might indicate speaker changes. We could insert synthetic speaker markers.
- AudioContext analysis: We were already using
AudioContextfor the visualizer. Could we do real-time voice fingerprinting from the microphone stream?
Each of these ideas sounds plausible for about five minutes. Then you implement them.
What Actually Happened in Implementation
The volume approach immediately hit a wall: in a meeting room, everyone is roughly the same distance from the microphone. Our "primary speaker" threshold produced random false positives. In a quiet one-on-one, it sort of worked. In an actual meeting? Useless.
Pause-based segmentation was worse. We inserted speaker break markers whenever silence exceeded 2 seconds. This produced transcripts that looked like speaker changes but were just... pauses within a single person's speech. "Alice: I think we should con—" break "Bob: sider the budget implications." Those were the same sentence from the same person. The transcript was now actively misleading.
The real-time voice fingerprinting idea was the most technically interesting and the most catastrophically flawed. Using the AnalyserNode from Web Audio API, we can extract frequency data from the microphone stream. We built a rudimentary voice embedding system using spectral centroid and MFCC-like features computed in JavaScript. It worked great on a single speaker. With two speakers who had significantly different voices, it got it right about 60% of the time. That sounds okay until you realize 40% wrong on a meeting transcript is completely unusable.
// What we tried - voice fingerprinting via AudioContext
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 2048;
const dataArray = new Float32Array(analyser.frequencyBinCount);
analyser.getFloatFrequencyData(dataArray);
// Compute spectral centroid as crude voice ID feature
const centroid = dataArray.reduce((sum, val, i) => sum + (val * i), 0)
/ dataArray.reduce((sum, val) => sum + val, 0);
// Match to known speaker profiles...
// This worked ~60% of the time. Not good enough.
The Real Cost: User Trust and UI Clutter
Here's the thing about shipping a feature that doesn't work well: it actively erodes trust. A user who sees a "Set Primary Speaker" button assumes it does something meaningful. When the speaker labels are wrong — which they were, constantly — the user doesn't think "oh, this is technically difficult." They think "this product is broken."
We had speaker profiling cards on the landing page. We had it in the app settings. We had it in the Korean version of the product description. Every time a user saw a speaker attribution that was wrong, VORA's credibility took a hit. Shipping a broken feature is worse than not shipping the feature at all.
The Decision to Delete Everything
The commit message reads: "Remove speaker identification UI and logic (Web Speech API limitation)". What it doesn't capture is the conversation that led to it.
We spent an afternoon seriously evaluating alternatives. Could we move to a server-side architecture and use a proper diarization model? Yes, but that would mean VORA is no longer fully browser-based and free, which is core to what we're building. Could we integrate Whisper in the browser and run a speaker diarization model locally? In theory yes, but the WASM performance was far too slow for real-time use (more on that in a separate post). Could we just make the heuristic-based system better? We tried. It wasn't a solvable problem at the quality bar we needed.
It was a lot of code to delete. It felt bad to delete it. But the codebase was cleaner, the product was more honest, and — critically — users who tested the cleaned-up version gave us consistently better feedback because the transcript just showed what was said without false speaker labels cluttering it.
What We Shipped Instead
With speaker detection gone, we had development capacity freed up. We invested it into the things that actually made the transcript more useful: the AI correction pipeline (which became the Gemini integration), the N-Best candidate reranking system, and the meeting context injection that helps correct domain-specific terminology. These features don't tell you who said something, but they make sure what was said is transcribed accurately. That turned out to be far more valuable.
The Lesson We Keep Coming Back To
There's a category of engineering decisions where the honest answer is "we cannot do this at acceptable quality given our architectural constraints, and pretending otherwise makes the product worse." Speaker diarization in a purely browser-based, Web Speech API-backed application is that kind of problem. No amount of JavaScript cleverness fixes a missing API capability.
The lesson isn't "don't be ambitious." It's: understand your platform's hard limits before building user-facing features that depend on those limits being soft. We should have built a quick prototype, seen the 60% accuracy, and made the cut immediately. Instead we polished the broken UI for two weeks.
For future reference — and for anyone else building on the Web Speech API — the hard limits are: no speaker diarization, no raw audio access, Chrome-only in practice for production quality, and dependent on a network connection to Google's servers. Build beautiful things within those constraints rather than ugly hacks around them.
VORA is better without the feature. That's the most honest thing we can say about it.