Groq Dual-AI Integration: Why We Added a Second AI and What It Actually Fixed

Adding a second AI provider to an application that already has one isn't an obvious decision. It adds complexity: two API keys to manage, two rate limit systems to track, two sets of API-specific quirks to work around. We did it anyway, because Gemini — excellent as it is for summarization and Q&A — has a characteristic latency profile that doesn't suit real-time speech correction. Here's the full story of the Groq integration, including the 400 error that broke it on day one and what we did about it.

The Problem with Using Gemini for Everything

After solving the 429 rate limit problem (see our dedicated post on the priority queue), VORA's Gemini integration was stable. But stability revealed a new issue: latency asymmetry. The different task types have different latency requirements that a single model can't optimally serve:

CORRECTION: Needs to complete within 1-2 seconds to feel real-time. Gemini Flash achieves this about 70% of the time, but occasionally takes 3-5 seconds under load. For real-time correction, occasional 5-second delays are very noticeable.
QA: Users can tolerate 2-3 seconds for a question answer. No problem here.
SUMMARY: Background operation, latency doesn't matter to the user at all.

The correction use case specifically needs a model optimized for speed over depth. Gemini Flash is a great generalist. But Groq's Llama 3.3 70B, running on Groq's custom LPU (Language Processing Unit) hardware, delivers consistently faster inference for short prompts — response times for correction tasks are consistently sub-1-second thanks to Groq's hardware-level optimizations.

The Architecture Decision: Optional Second Model

We didn't want to force users into managing two API keys if they didn't need to. The integration was designed as opt-in: a "Fast Correction Mode" toggle in settings that, when enabled, routes CORRECTION tasks to Groq and leaves QA and SUMMARY on Gemini. If Groq fails or isn't configured, everything transparently falls back to Gemini.

The TextCorrector module already had an abstraction layer — the aiCorrect() method that either called _grokCorrect() or _geminiCorrectOriginal() based on the useGrokForCorrection flag. Adding the second model was mostly a question of writing a clean GrokAPI class that matched the interface Gemini expected, and wiring up the UI toggle.

Day One: The 400 Error

We deployed the Groq integration. Within minutes of the first real test session: HTTP 400 from api.groq.com/openai/v1/chat/completions. Consistent, every request. The error message from the API wasn't helpful: just "Bad Request."

The GrokAPI class was using model ID llama-3.3-70b. This is where we made a critical assumption error: we inferred the model name from Groq's documentation summary and assumed the API model identifier would be a straightforward version number without a variant suffix.

It doesn't work that way. The Groq API requires the full model identifier including the deployment variant — llama-3.3-70b-versatile for the general-purpose 128K-context deployment. Omitting -versatile causes a 400 because the model name doesn't resolve to any valid endpoint. Checking the actual Groq model list at console.groq.com (which we should have done before hardcoding the name) confirms the correct identifiers.

// The bug:
this.model = 'llama-3.3-70b';           // Does not exist as a Groq model ID

// The fix:
this.model = 'llama-3.3-70b-versatile'; // Correct Groq model identifier (128K ctx)

There was a secondary discovery while debugging: Groq's API follows the OpenAI-compatible format, which means temperature, max_tokens, and standard message roles all work exactly as documented. Unlike some other LLM APIs that restrict parameters for certain model types, Groq's Llama endpoints accept the full standard parameter set, which simplified the integration considerably.

The Fallback Pattern That Made This Safe to Ship

Even after fixing the model name issue, we had a concern: what happens when a user has Groq configured but the Groq API has issues (rate limits, service disruption, network problems)? The correction pipeline silently breaks if the fallback isn't solid.

The implementation in TextCorrector._grokCorrect() has explicit fallback handling:

async _grokCorrect(text) {
    try {
        const result = await this.grokAPI.correctText(text, prompt);
        if (result) { /* process and return */ }
        return { text: text, isQuestion: false }; // empty result fallback
    } catch (error) {
        console.warn('[Groq] Correction failed, falling back to Gemini:', error.message);
        // Fallback to Gemini if Groq fails
        if (this.geminiAPI && this.geminiAPI.isConfigured) {
            return this._geminiCorrectOriginal(text);
        }
        return { text: text, isQuestion: false }; // worst case: no correction
    }
}

The fallback chain is: Groq → Gemini (if configured) → original text (no correction). Users might see slightly less correction if Groq is down, but they never see a broken state. The admin monitoring dashboard shows Groq stats (total requests, errors, avg latency) so we can detect when the fallback is activating frequently.

What Dual-AI Actually Unlocks

With Groq handling correction and Gemini handling depth tasks, the priority queue in GeminiAPI is much less congested. CORRECTION tasks, which were the majority of queue items, are now routed to Groq's own request queue. Gemini's queue sees mostly QA and SUMMARY tasks. The practical effect: question answers arrive faster because they're not waiting behind correction tasks.

The Groq correction latency in practice: median ~650ms, 95th percentile ~1.2s. Gemini correction latency: median ~900ms, 95th percentile ~2.8s. For real-time correction of continuous speech, the Groq path gives users a meaningfully more responsive experience.

The Admin Dashboard: Monitoring Dual-AI in Production

One of the more interesting features that emerged from the dual-AI architecture was the expanded admin dashboard. Originally the admin page monitored Gemini queue state. With Groq added, we extended the heartbeat system (which broadcasts to a BroadcastChannel from the app tab) to include Groq stats: grokEnabled, grokStats.totalRequests, grokStats.totalErrors, grokStats.avgLatency.

The admin page can open in a separate browser tab and see the live state of both AI systems. In a long meeting session, you can watch the Groq correction count climb, the Gemini QA count climb, and see whether the error rates are acceptable. This was built for internal monitoring but turns out to be genuinely useful for debugging user-reported issues.

Where the Dual-AI Approach Falls Short

Honest assessment: for most users on a free Gemini API key, Fast Correction Mode with Groq is probably not worth the friction of obtaining and configuring a second API key. The improvement in correction latency is real but subtle — most users won't notice the difference between 900ms and 650ms correction time. The feature is most valuable for power users doing long multi-hour meetings where the performance difference compounds.

The Groq API also has its own rate limits on the free tier (~30 requests/min), which we're still profiling for long sessions. The current implementation doesn't coordinate rate limiting between Groq and Gemini — if Groq gets rate limited, it falls back to Gemini, which means Gemini's CORRECTION quota suddenly gets used up. A more sophisticated implementation would track the combined correction rate and choose between models based on which has more headroom. That's a future enhancement.

The broader lesson from this integration: the complexity of a two-model architecture is only justified when the two models have genuinely different strengths that map to different tasks in your application. Hardware-accelerated speed for short prompts (Groq/Llama) versus deep reasoning with long context (Gemini) is a meaningful differentiation. If you're just using two models as hot-standbys for each other, the complexity isn't worth it — use a single model with good retry logic instead.