How We Conquered Gemini's 429 Error in Real-Time Speech Recognition

Building a real-time AI-powered meeting assistant sounds thrilling on paper. In practice, it's a relentless war against latency, API rate limits, and the cruel reality that AI models were never designed to handle the firehose of real-time speech data. This is the story of how VORA's engineering team nearly broke Google's Gemini API—and how we fixed it.

The Dream: Real-Time AI Speech Correction

VORA's core promise is simple: speak naturally during a meeting, and AI will transcribe, correct technical jargon, detect questions, and generate summaries—all in real-time. The Web Speech API handles raw transcription, but its output is often messy. Words like "API" become "에이피아이" (Korean phonetic), "LogP" turns into "로그피", and domain-specific terms get butchered.

Enter Gemini Flash. We designed a pipeline where every finalized sentence from the Web Speech API would be sent to Gemini for intelligent context-aware correction. The first version looked elegant in our architecture diagram.

[Microphone] → [Web Speech API] → [Raw Text] → [Gemini API: Correct + Detect Questions] → [Polished UI]
↓
[Every 60s: Gemini Summary Task]

The Nightmare: 429 Errors Everywhere

The moment we tested with real meeting audio, disaster struck. Here's what our console looked like after 2 minutes:

// Console output after 2 minutes of real-time meeting
[GeminiAPI] CORRECTION started (gemini-flash-latest) - Queue: 0
[GeminiAPI] CORRECTION completed
[GeminiAPI] CORRECTION started (gemini-flash-latest) - Queue: 3
[GeminiAPI] CORRECTION started (gemini-flash-latest) - Queue: 5
ERROR [GeminiAPI] CORRECTION error: HTTP 429 - Resource has been exhausted
ERROR [GeminiAPI] QA error: HTTP 429 - Resource has been exhausted
ERROR [GeminiAPI] SUMMARY error: HTTP 429 - Resource has been exhausted
// ...the entire system goes silent. All AI features paralyzed.
            

What Happened?

A person speaks roughly 150-180 words per minute. Web Speech API finalizes results every 2-5 seconds. That means we were firing 12-30 Gemini API requests per minute just for text correction. Add question detection, AI answers, and periodic summaries, and we were easily hitting 40-50+ requests per minute—far exceeding Gemini's free-tier RPM (Requests Per Minute) limit of 15.

The worst part? Once 429 errors start, they cascade. The correction queue backs up, summary requests pile on, and the retry logic floods even more requests. Gemini didn't just slow down—it went completely catatonic.

The Fix: A 4-Layer Defense Architecture

We didn't solve this with a single trick. It took a systematic, multi-layered approach that we refined over weeks of real-world testing.

Layer 1: Local-First Dictionary Correction (Zero API calls)

The first breakthrough was realizing that most corrections don't need AI at all. We built a local phonetic dictionary that maps common misrecognitions to correct terms:

// text-corrector.js - Local dictionary (no API needed)
const dictionary = {
    '에이피아이': 'API',
    '유아이': 'UI',
    '에스디케이': 'SDK',
    '지피티': 'GPT',
    // Dynamically learned during session...
};

// Quick local correction runs instantly, no network call
quickCorrect(text) {
    let corrected = text;
    for (const [wrong, correct] of Object.entries(dictionary)) {
        if (corrected.includes(wrong)) {
            corrected = corrected.split(wrong).join(correct);
        }
    }
    return corrected;
}
            

This alone eliminated ~40% of correction requests from ever hitting the API. The local dictionary also learns during a session: when Gemini successfully corrects a term, the mapping is stored locally so the same correction never requires an API call again.

Layer 2: Priority Queue with Type-Based Rate Limiting

Not all API tasks are equally important. A user's direct question should never be blocked by a routine text correction. We implemented a priority queue:

// gemini-api.js - Priority-based task execution
const priorityMap = {
    'QA': 0,          // Highest: User questions answered immediately
    'CORRECTION': 1,  // Medium: Text correction can wait a few seconds
    'SUMMARY': 2      // Lowest: Summaries run every 60 seconds anyway
};

// Per-type minimum intervals (RPM management)
this.minIntervals = {
    CORRECTION: 5000,  // Max 12 corrections/min
    QA: 500,           // Questions processed almost instantly
    SUMMARY: 10000,    // Max 6 summaries/min
    GLOBAL: 1500       // Hard floor: never more than 40 req/min total
};
            

The Key Insight

By introducing per-type intervals, we ensured that a burst of speech corrections could never starve the QA or summary pipeline. The global interval of 1.5 seconds acts as a hard ceiling, keeping total RPM well under Google's limits even during heated discussions.

Layer 3: Intelligent Skip Logic (AI Triage)

Here's the counterintuitive part: not every sentence needs AI correction. We built a triage function that decides whether a sentence is "interesting enough" to send to Gemini:

// Only send to Gemini if the text matches patterns that suggest
// it might contain misrecognized technical terms
mightNeedAICorrection(text) {
    const indicators = [
        /[가-힣]{2,}(닙|맙|빕)/,      // Drug name patterns (-nib, -mab)
        /(값|농도|수치|결과)/,          // Measurement-related words
        /[가-힣]\s?[가-힣]\s?[가-힣]/, // 3 consecutive syllables (possible acronym)
        /(로그|피|씨|케이|에이)/,       // English letter pronunciations in Korean
        /(그\s?값|그것|이것|저것)/,     // Demonstrative pronouns (need context)
    ];
    return indicators.some(pattern => pattern.test(text));
}
            

Simple sentences like "Okay, let's continue" or "I agree with that" skip AI entirely. This cut another ~30% of API calls.

Layer 4: Background Non-Blocking Correction

The final piece: corrections happen asynchronously in the background. The user sees the local-corrected text immediately. When Gemini's correction arrives (sometimes 2-5 seconds later), the UI updates smoothly with a sparkle badge (✨).

// Non-blocking flow:
// 1. User sees locally-corrected text INSTANTLY
// 2. Background: Gemini processes the same text
// 3. When Gemini responds, UI updates via callback

async correct(text, candidates, id) {
    // Step 1: Local correction (instant, 0ms)
    const quickCorrected = this.quickCorrect(text);
    
    // Step 2: Return immediately so UI isn't blocked
    return { text: quickCorrected, wasCorrected: true };
    
    // Step 3: Fire-and-forget AI correction in background
    if (shouldUseAI) {
        this.aiCorrectInBackground(text, quickCorrected, id);
        // Result delivered via onCorrected callback later
    }
}
            

The Results: Before vs. After

Performance Comparison

Metric	Before (Naive)	After (4-Layer)
API Calls/min	40-50+	8-12
429 Errors	Constant after 2 min	Zero in 60+ min sessions
Correction Latency (Perceived)	3-8 sec (blocking)	0ms local + 2s AI polish
System Stability	Crashes after 5 min	Stable 2+ hour sessions

The Next Evolution: Dual-AI with Groq

Even with the 4-layer defense, there's an inherent tension: real-time correction wants speed, but Gemini's strength is deep contextual understanding. We're now running a dual-AI architecture where Groq's Llama 3.3 70B handles the latency-critical first-pass correction (hardware-accelerated LPU inference keeps it consistently sub-1-second), while Gemini focuses on what it does best—contextual understanding, summarization, and intelligent Q&A.

[Speech] → [Local Dict] → [Groq/Llama: Fast Correction (low latency)] → [Display]
↓ (parallel)
[Gemini: Deep Context (summary, Q&A, learning)]

This dual-model approach lets each AI play to its strengths while staying well within both APIs' rate limits.

Lessons Learned

Week 1: "Just send everything to Gemini"

Naive approach. System died within 2 minutes. Every sentence triggered an API call. 429 errors cascaded into total system paralysis.

Week 2: "Add exponential backoff"

Helped prevent crashes but introduced 10-30 second correction delays. Users saw stale text. Unacceptable for real-time use.

Week 3: "Local dictionary + interval throttling"

Major breakthrough. 60% fewer API calls. But priority conflicts remained—summaries would block corrections during peak speech.

Week 4: "Priority queue + AI triage + background correction"

The final architecture. Zero 429 errors. Smooth real-time experience. Users couldn't tell the difference between local and AI corrections.

Advice for Developers Building Real-Time AI Applications

Never trust "unlimited" API tiers. Even paid plans have burst limits. Design for the worst case from day one. If your app can generate more than 10 requests/minute, you need a queue system.
Local-first is not a compromise—it's a feature. A 0ms local correction that's 80% accurate is infinitely better than a 5-second AI correction that's 99% accurate. Users notice latency more than perfection.
Prioritize ruthlessly. Not all AI tasks are equal. A user's direct question should never wait behind a routine background task. Implement priority levels and per-type rate limits.
Make AI corrections non-blocking. Show the best available result immediately, then polish in the background. The "shimmer" effect when AI improves text is actually delightful, not jarring.
Build a triage system. 30-50% of your inputs probably don't need AI at all. A simple regex or heuristic check can save enormous API costs and improve perceived speed.
Monitor relentlessly. We built an admin dashboard that shows real-time queue depth, API processing status, and error rates. Without it, we'd be debugging blind. You should too.
Consider multi-model architectures. Different AI models have different strengths and rate limits. Using a fast model for latency-critical tasks and a smart model for deep analysis is often better than trying to make one model do everything.
Session learning compounds. Every AI correction that gets cached locally means one less future API call. After 10 minutes, our session dictionary handles 60%+ of corrections without any network request.

Try It Yourself

VORA is a free, no-signup AI meeting assistant. Experience the architecture described in this post firsthand: Launch VORA. All processing happens in your browser—your meeting data never touches our servers.