The Whisper WASM Experiment: Why Running AI in the Browser Is Harder Than It Looks

The dream of running a large neural network model entirely in the browser — no server, no API call, complete privacy, full offline capability — is genuinely compelling. It's the kind of thing that looks clean on a conference slide. Then you actually try to implement it, and you discover that the browser is a hostile environment for heavy computation in ways that don't become apparent until you're deep in the weeds.

This is the documented story of our Whisper browser integration experiments: what we tried, what the specific failure modes were, what actually worked for different use cases, and why the beta feature in our Labs section is labeled "experimental" rather than "production."

Why Whisper in the Browser Seems Reasonable

OpenAI Whisper is a transformer-based speech recognition model. The architecture is relatively straightforward: an audio encoder processes mel spectrogram features, a text decoder generates tokens autoregressively. The model weights come in several sizes: Tiny (39M params), Base (74M), Small (244M), Medium (769M), Large (1.5B), and the more recent Large-v3-Turbo (809M, optimized).

Running neural networks in the browser has become increasingly feasible. WebAssembly (WASM) provides near-native execution speed for compute-intensive tasks. The ONNX Runtime Web library can run exported ONNX models in a browser. Projects like Whisper.cpp have optimized CPU inference for Whisper specifically. Transformers.js from Hugging Face runs entire inference pipelines in-browser. The theoretical foundation is solid.

We had Whisper running in the browser within a few days of starting the experiment. But "running" and "production-ready" turned out to be very different things.

Experiment 1: ONNX Runtime Web with Whisper Turbo

Our first attempt used ONNX Runtime Web with the Whisper Large-v3-Turbo model exported to ONNX format. The model file is approximately 1.6GB. Right there, problem number one: a 1.6GB download on first load is completely unacceptable for a web application. Even with caching (using the Cache API or IndexedDB), first-time users face a multi-minute wait before the feature is usable.

We tried the Tiny and Base models instead. Download sizes: Tiny is ~40MB (acceptable), Base is ~80MB (borderline). Model accuracy with Tiny for Korean: roughly comparable to a mediocre Web Speech API result on a good day, and significantly worse than Web Speech API on a bad day. The whole point of using Whisper was higher accuracy. The accurate models are too large. The fast models aren't accurate enough. This is the fundamental tension.

The CORS rabbit hole: ONNX Runtime Web with threading enabled requires SharedArrayBuffer. SharedArrayBuffer requires the page to be served with specific HTTP headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp. These headers have implications for cross-origin resource loading. Our existing CDN-sourced fonts, external scripts, and ad network code all broke. We had to either vendor everything locally or disable the threading that made WASM inference viable.

The CORS issue consumed an entire week. The commit "Fix CORS issues for ONNX Runtime pages" resulted in us vendoring the entire ONNX Runtime Web library locally (adding ~15MB to the repository), creating a custom local server.py that injects the required headers, and updating all affected pages to use local library paths. This worked for local development but created a deployment complexity: Cloudflare Pages serves static files without custom HTTP headers by default. We needed to configure _headers files specifically for WASM pages.

Experiment 2: Sherpa-ONNX for Streaming ASR

Sherpa-ONNX is a production-quality speech recognition framework from the Next-gen Kaldi project. It compiles to WASM and supports streaming inference — meaning you don't need to wait for the user to finish speaking before processing begins. This addresses one of Whisper's core limitations for real-time use: Whisper is fundamentally a batch processor, designed to transcribe complete audio segments.

We got a Sherpa-ONNX streaming model running. The streaming capability was real and genuinely impressive for latency. But Sherpa-ONNX models for Korean are less mature than the English models, and the WASM bundle size again became an issue. The sherpa-onnx WASM binary is ~8MB, plus the model files. And we were back to the SharedArrayBuffer headers requirement.

Performance Benchmarks: The Numbers That Killed the Dream

After weeks of testing, here are the honest numbers on a mid-range laptop (2021 MacBook Pro M1) and a typical Android phone (Samsung Galaxy S21):

Model	Platform	First-token latency	Real-time factor	Korean WER (approx)
Whisper Tiny (WASM)	MacBook M1	1.2s	0.4x (faster than audio)	~32%
Whisper Base (WASM)	MacBook M1	2.1s	0.8x	~24%
Whisper Large-v3-Turbo (WASM)	MacBook M1	11s	4.2x (slower than audio)	~12%
Whisper Tiny (WASM)	Galaxy S21	3.8s	1.9x (slower than audio)	~35%
Web Speech API	Any	<0.2s (streaming)	Real-time	~15-20%

The table tells the story clearly. On desktop, Whisper Base achieves accuracy comparable to Web Speech API with acceptable latency — but only on desktop, only with threading enabled (the SharedArrayBuffer requirement), and only after the model download. On mobile, even Whisper Tiny is slower than real-time, meaning you'd have to chunk audio and process in batches. The latency for anything accurate enough to be useful is multiple seconds per segment. For real-time transcription in a live meeting, that's not workable.

The Specific Issue with Korean

Korean transcription in Whisper deserves its own section because the model behavior is counterintuitive. Whisper was trained on internet audio, and Korean audio on the internet is overrepresented by certain types of content: YouTube videos, podcast-style speech, professional narration. Casual meeting speech — overlapping sentences, filler words, topic switches mid-sentence — patterns differently from training data.

More critically, Korean technical and business vocabulary is essentially out-of-vocabulary for the default Whisper models. A meeting discussion about "GCP Kubernetes 클러스터 배포" (GCP Kubernetes cluster deployment) would come back as a hallucinated mix of similar-sounding Korean phrases. The Web Speech API, backed by Google's massive Korean speech dataset, handles this dramatically better for technical contexts.

This is also why our TextCorrector module exists: even with Web Speech API, domain-specific terminology needs correction. With Whisper, the problem is more fundamental — it's not just mis-transcribed technical terms, it's entire sentences that need to be re-generated.

What Whisper in the Browser Is Actually Good For

We didn't conclude that browser-based Whisper is useless. We concluded it's a different tool for different use cases:

Offline/private processing of pre-recorded audio: Upload a recording, let Whisper Turbo process it over a few minutes. The latency is fine when you're not waiting for real-time output.
High-accuracy batch transcription: When accuracy matters more than speed and you don't want audio leaving the browser, Whisper Large gives excellent results even if it takes time.
Desktop-class hardware with dedicated GPU: WebGPU is maturing. ONNX Runtime Web with WebGPU backend reduces inference time by 3-5x over WASM CPU. On a MacBook M1 with WebGPU, Whisper Large-v3-Turbo processes audio at approximately 1.8x real-time — still not fast enough for live streaming, but narrowing the gap.

Where We Landed: Labs as the Honest Experiments Section

The Labs section of VORA exists specifically to contain these experiments in a way that's honest about their status. The Whisper Beta page is clearly labeled experimental. Users who need offline processing or maximum privacy can try it and give us feedback. The main VORA app continues to use Web Speech API because it's the right tool for real-time meeting transcription in 2026.

We built the Hybrid ASR test page to combine approaches: Web Speech API for streaming real-time output, Whisper processing the same audio in the background for a high-accuracy delayed transcript. This combination has interesting properties — you get immediacy from Web Speech API and accuracy from Whisper, with the accurate version replacing the streaming transcript after a few seconds. We're still evaluating whether this UX model works for real users.

The honest summary: browser-based large model inference is a technology that's improving rapidly and will likely transform what's possible in 1-2 years, especially as WebGPU matures. Today, for real-time speech recognition, it's not there yet for mobile users or for anyone who needs immediate results. But it's worth keeping in Labs, improving incrementally, and being ready when the hardware and browser capabilities catch up.