Building the N-Best Candidate Reranking System for Better Korean STT

One of the least-discussed capabilities of the Web Speech API is that it returns not just the top-1 transcription but an N-Best list: multiple alternative hypotheses ranked by confidence. Most applications ignore all but the first result. We built a local reranking system that selects better candidates using domain knowledge — achieving accuracy improvements without any additional API calls. Here's the detailed explanation of how and why it works.

What the Web Speech API Actually Returns

When the Web Speech API fires an onresult event with a final result, the event contains a SpeechRecognitionResultList. Each list item is a SpeechRecognitionResult with a length property indicating how many alternative hypotheses are available. Each alternative has a transcript (the text) and a confidence score (0.0 to 1.0).

In practice, on Chrome with Korean speech, we typically see 2-3 alternatives per final result. The confidence scores are often very close — the top candidate might score 0.85 and the second candidate 0.83. These small confidence differences don't necessarily reflect the actual quality difference between candidates. The acoustic model gives similar scores to phonetically similar alternatives.

// What we get from Web Speech API:
event.results[0][0] = { transcript: "API 호출이 실패했어요", confidence: 0.87 }
event.results[0][1] = { transcript: "에이피아이 호출이 실패했어요", confidence: 0.85 }
event.results[0][2] = { transcript: "API 호출이 실패해요", confidence: 0.83 }

// The first result has the correct form "API" (3 letters)
// The second has the phonetic Korean expansion "에이피아이"
// We want result [0] but confidence alone might sometimes prefer [1]

The Insight: Domain Knowledge Beats Confidence Scores for Technical Terms

For general conversational speech, the Web Speech API's confidence scores are reliable. For domain-specific speech — technical meetings discussing API architectures, pharmaceutical research discussing molecular compounds, biotech labs discussing CRISPR protocols — the acoustic confidence is systematically unreliable for specialized terms.

The reason is phonetic ambiguity. "API" spoken in Korean is either the English letters (A-P-I, 3 syllables: "에이-피-아이") or the Korean phonetic expansion (9+ syllables). Both are equally valid transcriptions of the same phoneme sequence. The acoustic model assigns similar scores to both. But in a software development meeting, "API" (the English letters) is almost certainly what the speaker meant.

Our selectBestCandidateLocal() method in TextCorrector resolves this using the domain dictionary and priority terms. For each candidate, we:

Apply local dictionary corrections to see which candidate best matches known technical terms.
Check how many priority terms (user-defined) appear in the candidate after correction.
Score technical term presence: each recognized technical pattern adds 0.1 to the score.
Score priority term presence: each user priority term adds 0.3 to the score.
Add small bonus if correction was applied (suggests acoustic model was wrong but dictionary knew better).

// N-Best reranking: domain-aware scoring
for (const candidate of candidates) {
    let score = candidate.confidence || 0;
    const corrected = this.quickCorrect(candidate.transcript);

    // Technical term bonus (AI, API, SDK patterns etc.)
    const technicalTerms = corrected.match(
        /[A-Z][A-Za-z0-9]+|LogP|pKa|IC50|Cmax|PCR|ELISA/g
    );
    if (technicalTerms) score += technicalTerms.length * 0.1;

    // User priority term bonus
    for (const term of this.priorityTerms) {
        if (corrected.includes(term)) score += 0.3;
    }

    // Correction happened = dictionary matched = likely correct
    if (corrected !== candidate.transcript) score += 0.05;

    if (score > bestScore) { bestScore = score; bestCandidate = candidate; }
}

The Session Dictionary: Learning During a Meeting

The most technically interesting component of the TextCorrector is the session dictionary — a dynamically built vocabulary that grows during a meeting session. The intuition: if the AI correction changes "씨알이에스피알" to "CRISPR" in one utterance, that correction should inform future utterances in the same session.

The learning mechanism works through the learnFromCorrection() method:

When AI correction changes word A to word B, the system records this as potential evidence.
A confidence counter tracks how many times this correction pattern has been seen.
Once the confidence exceeds a threshold (0.8), the correction is added to the session dictionary and applied to future utterances without any API call.
Technical term patterns (ALL_CAPS sequences, Korean pharmaceutical suffixes -닙, -맙) are extracted and added as priority terms automatically.

The practical effect: in a pharmaceutical meeting discussing a drug called "베르테포르핀" (Verteporfin), the first time the word appears it might be mis-transcribed and corrected by Gemini. From that point on, every subsequent mention is corrected locally without an API call, with zero latency.

Why Not Just Always Use the AI?

The question "why build this complex local system instead of just calling Gemini every time" has a simple answer: rate limits and latency. The free Gemini API tier allows roughly 15 requests per minute (depending on current quota). A typical meeting at 30 words per minute generates approximately 10 final speech results per minute. Each result would need correction. That's potentially 10 API calls per minute just for correction — before accounting for QA and SUMMARY.

The local system (N-Best reranking + dictionary correction + session learning) handles roughly 50% of correction cases locally with zero API calls and zero latency. The AI correction handles the remaining 50% that require semantic understanding. Total API calls drop by half, and the half that does use the API is the half where the AI genuinely adds value over a lookup table.

Persona Prompts: Different Domains Need Different Context

Beyond the local dictionary, the AI correction prompt is domain-specific. The system supports six personas: pharma, chemistry, biotech, food science, IT/software, and general. Each persona injects domain-specific vocabulary and context into the system prompt sent to Gemini or Groq.

This matters because ASR ambiguities are domain-specific. In a pharma meeting, "Phase 2" is a clinical trial phase. In a software meeting, "Phase 2" might be a project phase. "IC50" means inhibitory concentration in pharma but is meaningless in IT. The persona system ensures the AI correction is working with the right frame of reference.

We also inject the last 3 corrected utterances as context: "Recent context: [sentence 1], [sentence 2], [sentence 3]." This helps with pronouns and references — when the speaker says "그 값" (that value), the recent context tells the AI what "that value" referred to and whether it needs to be corrected or preserved.

The Numbers: What Local Reranking Actually Achieves

In internal testing with simulated pharmaceutical meeting audio (a mix of Korean and English technical terms), the full correction pipeline achieved approximately:

N-Best reranking alone: 8% reduction in word error rate versus taking only the top-1 hypothesis.
Local dictionary correction: additional 15% reduction in domain-specific term errors.
AI correction (Gemini): additional 12% reduction in semantic errors and complex mis-transcriptions.
Session learning (after first 5 minutes): additional 6% reduction as the session dictionary accumulates corrections.

The N-Best reranking result (8% WER improvement with zero API calls) was higher than expected. We initially expected it to be marginal. The improvement comes specifically from technical meetings: in general conversational speech testing, N-Best reranking contributed only 2-3% improvement, which is within noise. For domain-specific meetings, the improvement is real and worth the complexity.