Offline Voice Chatbot in the Browser
Building a fully offline voice interface for LLMs using WebAssembly — VAD, speech-to-text, and text-to-speech all running client-side.
I wanted to talk to an LLM with my voice, but I didn’t want to send my audio to anyone’s server. The result is a browser-based voice chatbot where all the speech processing — detection, transcription, and synthesis — runs entirely in your browser via WebAssembly. The only thing that leaves your machine is the text sent to whatever LLM endpoint you point it at.
The Pipeline
The system chains three models together into a continuous voice loop:
- Silero VAD (via vad-web) listens to the microphone and detects when you’re actually speaking, filtering out silence and background noise.
- Moonshine takes the detected speech and transcribes it to text. There are two model sizes — Tiny for speed, Base for accuracy — both running as ONNX models in the browser.
- The transcribed text gets sent to any OpenAI-compatible API endpoint (local or remote), and the response comes back as text.
- Piper synthesizes the LLM’s response into speech, again entirely client-side using a modified vits-web inference engine.
The whole thing runs as a static site served from a simple HTTP server. No build step, no bundler, just HTML and JavaScript.
Why Browser-Side?
Partly principle, partly practicality. I wanted the voice processing to stay local because microphone audio is about as personal as data gets. But there’s also a practical angle — by keeping VAD and STT in the browser, the system works with any LLM backend. Swap in a local llama.cpp server, a cloud API, or anything that speaks the OpenAI chat format. The voice layer doesn’t care.
It also makes for a surprisingly good testbed. Adding a new TTS voice is just dropping a Piper model file into a folder. Switching STT models is a dropdown. The whole thing is designed to be pulled apart and reassembled for whatever voice-enabled application you want to build next.
Rough Edges
Moonshine’s ONNX models aren’t tiny — the browser has to download and initialize them on first load. And WebAssembly inference is never going to match native speed, though on a modern machine it’s fast enough to feel responsive. The real bottleneck in practice is the LLM response time, not the voice processing.
Piper’s voice quality is decent but noticeably synthetic compared to something like Fish Audio S2-Pro. That’s the tradeoff for running TTS in a browser tab with no GPU.
What It’s For
This was always meant as a launching pad more than a finished product. The core question was: can you build a complete voice-to-voice AI interface that runs offline in a browser? The answer is yes, and the latency is surprisingly tolerable. From there it becomes a platform for experimenting with voice-controlled applications, accessibility tools, or just having a hands-free conversation with a local model.