Building the Priority Queue: How We Stopped Gemini API Chaos

The commit history for VORA's Gemini API integration is a graveyard of approaches that seemed correct and weren't. You can read the arc just from the commit messages: "integrate text correction," "stabilize priority queue," "resolve conflicts between parallel and queue," "stable priority queue edition." Each one represents a full redesign of the system. By version 4 (which is what ships today), we had learned enough to get it right. Here is the detailed account of what broke and why.

The Parallel Architecture That Created Race Conditions

The original Gemini API integration used a parallel architecture. Three types of tasks needed Gemini API access: CORRECTION (fix speech-to-text errors in real time), QA (answer user questions about the meeting), and SUMMARY (periodically summarize what's been discussed). The first implementation fired all three concurrently — each type had its own fetch call, its own rate limit tracking, and they ran independently.

This was a disaster. The sequence of events in a typical 10-minute session:

User starts recording. Web Speech API starts returning results every 2-3 seconds.
Each result triggers a CORRECTION request. 3 CORRECTION requests arrive in 10 seconds.
User asks a question. QA request fires immediately.
20-second timer fires. SUMMARY request fires.
Gemini API returns 429 on the 4th request within the rate limit window.
All three concurrent retry loops start backing off independently. They all retry at slightly different times and create another burst.
The backoff intervals are now incoherent — CORRECTION thinks it can retry in 2s, QA thinks in 5s, SUMMARY thinks in 3s. Another burst. Another 429.

Within 2 minutes, the system was stuck in a 429 loop that cleared only after several minutes of complete inactivity. CORRECTION output was frozen. Question answers were delayed 3-4 minutes. The meeting summary was perpetually pending.

// The broken parallel design (v1):
// Three independent fetch calls, no coordination
async correctText(text) {
    const res = await fetch(GEMINI_URL, { body: correctionPayload });
    // Rate limit tracked separately from QA and SUMMARY
}
async generateAnswer(question) {
    const res = await fetch(GEMINI_URL, { body: qaPayload });
    // Completely unaware of CORRECTION requests
}
async generateSummary() {
    const res = await fetch(GEMINI_URL, { body: summaryPayload });
    // Also completely unaware
}

The Mutex Approach: Better, But Still Wrong

Version 2 introduced a global semaphore — a JavaScript boolean flag isApiLocked that each request would check before proceeding. If locked, the request would wait. This prevented concurrent requests and eliminated the burst problem. But it introduced a new failure mode: priority inversion.

Consider this scenario: A CORRECTION task starts executing (acquires the lock). While it's running (Gemini response time: ~1-2 seconds), the user asks an urgent question. QA needs to respond immediately for good UX. But QA is waiting behind the CORRECTION lock. Then a SUMMARY fires and queues up behind QA. Then more CORRECTION tasks arrive.

The user asks a question and waits 8 seconds for an answer because three speech correction tasks and a summary all queued ahead of it. From the user's perspective, the AI assistant just ignored their question for 8 seconds. Terrible experience.

The commit "fix: integrate text correction and stabilize priority queue" was supposed to fix this. It helped with the 429 issue but made the priority inversion worse because now everything was serialized.

Version 3: Priority Queue with a Bug We Didn't Catch for Days

The right answer was clearly a priority queue: tasks with higher urgency should preempt lower-priority tasks. We defined three priorities: QA (0, highest), CORRECTION (1), SUMMARY (2, lowest). A proper heap-based priority queue would ensure QA always processes before CORRECTION, CORRECTION before SUMMARY.

We implemented it. It worked in happy-path testing. Then we caught a subtle bug during stress testing: the queue was sorted on insertion, which was correct. But when a new high-priority task arrived while the queue processor was mid-execution of a lower-priority task, it joined the sorted queue — but the current task was not preempted. So a QA task could still wait behind a CORRECTION that was already in-flight at the moment the QA arrived.

More problematically: the SUMMARY task was being starved. With constant speech (lots of CORRECTION tasks) and frequent user questions (QA tasks), SUMMARY never reached the front of the queue. The meeting would run for 20 minutes with no summary generated because CORRECTION and QA tasks kept arriving faster than SUMMARY could be processed.

The Solution: Separate Throttles per Type + Single Queue

The final design, which is what GeminiAPI v4.6 implements today, combines the queue with per-type throttle intervals and a global minimum gap. Here's the architecture that actually works:

// GeminiAPI v4.6 - The design that stuck
this.minIntervals = {
    CORRECTION: 5000,  // Correct at most once per 5s per utterance
    QA: 500,           // Near-instant for user questions
    SUMMARY: 10000,    // Rate-limit summary API calls
    GLOBAL: 1500       // Never fire two requests within 1.5s total
};

// Priority: QA=0 > CORRECTION=1 > SUMMARY=2
// Single queue, sorted by priority then timestamp
// Each task checks both global AND type-specific cooldown before executing

The per-type throttle solves the SUMMARY starvation problem differently: SUMMARY doesn't need to win queue priority battles, it just needs guaranteed execution every 60 seconds (the summary timer). By keeping the type-specific minimum interval at 10s (to protect RPM), SUMMARY requests don't spam the queue but do eventually get their turn when no higher-priority items are pending in the brief windows between CORRECTION tasks.

The 1.5-second global minimum interval was a key insight from the 429 debugging. Even if CORRECTION and QA are technically within their own type-limits, firing two requests 100ms apart triggers rate limiting in practice because Gemini's free tier RPM counter is measured over short rolling windows. The global gap creates enough spacing that burst effects stop occurring.

The Non-Blocking Background Correction Pattern

There's one more piece that made the UX acceptable: decoupling AI correction from the transcript display. In earlier versions, we waited for the Gemini correction before adding text to the transcript. This meant the user saw a 1-2 second gap between speaking and seeing their words appear. In a meeting context, that gap is deeply uncomfortable — it suggests the system is slow or broken.

The breakthrough was separating local correction (instant, from a local dictionary lookup) from AI correction (asynchronous, from Gemini). The flow became:

Web Speech API returns text. Immediately apply local dictionary corrections (0ms).
Display text immediately in transcript.
Queue AI correction in background.
When Gemini responds (1-3 seconds later), quietly update the transcript entry with the improved text, adding a ✨ badge.

The user sees their speech appear instantly (local correction), and then a few seconds later the text might quietly improve. This is dramatically better UX than waiting. The onCorrected callback pattern and data-id attribute on transcript items enables this targeted DOM update without re-rendering the whole transcript.

Measuring the Improvement

Before the priority queue: 40-50+ Gemini API calls per minute in a typical 30-minute session, resulting in consistent 429 errors after the first 2 minutes. After the priority queue: 8-12 API calls per minute, zero 429 errors in sessions up to 2 hours. The number of calls dropped by 75% not because we're correcting less, but because we eliminated the redundant burst requests and the retry storms they created.