A Fully Local, In-Browser Voice Assistant

A while back, I demonstrated a proof of concept for a voice assistant that used a Voice Activity Detection (VAD) model to detect human speech, paired with Groq’s hosted Whisper and LLM endpoints. I used Groq because it was incredibly fast, and if you aren’t running Whisper locally, you save a ton of compute resources.

However, following the recent work I’ve been doing with my Minecraft bot, I found myself getting increasingly unhappy with the idea of having to send not just my voice, but all of my system’s outputs over to a third party.

So, I decided to cook up something completely local. This is essentially “ChatGPT Voice at Home”, but running almost entirely inside a single web page without any external dependencies.

#How It Works

The architecture relies heavily on WebAssembly to push what are normally backend AI tasks directly into the client’s browser. There is practically no backend server here, the server is literally just serving up a static index.html file.

Here is how the pipeline breaks down:

Voice Activity Detection (VAD): The browser listens to the microphone but doesn’t start recording until it detects actual human speech. Once it stops hearing speech, it cuts the recording. (I still need to tweak the thresholds on this so it better accommodates my lazy-tongue accent, but it works well).
Speech-to-Text: The audio is then transcribed directly in the browser using Moonshine. Moonshine is a super lightweight speech-to-text model designed for low-power devices, and because it’s running in WebAssembly, the audio never leaves the browser page.
The Brains: The transcribed text is sent to a local LLM. I’m currently using LM Studio to host a small 3-billion parameter model locally on my machine.
Text-to-Speech: Finally, the LLM’s text response is converted back into audio using Piper. I previously used Piper for my Minecraft bot, but it turns out someone compiled it into WebAssembly! So, just like the transcription, the speech generation happens entirely locally in the browser.

#The Results

I’m incredibly happy with how this turned out. To recap: the VAD, the speech-to-text transcription, and the text-to-speech generation are all running entirely inside of the browser via WebAssembly.

Because it’s just a static HTML file, I could theoretically throw this up on GitHub Pages or any static hosting site right now and serve it out to the world. The only thing you’d have to change is pointing the API endpoint to your own local LLM instance to back it up.

There is a bit of overhead, running Piper’s text-to-speech inside the browser takes about five and a half seconds to generate an audio response, whereas running it natively on my machine takes less than a second. But the trade-off for a completely zero-dependency, private, browser-based voice assistant is pretty amazing.

Looking forward to tinkering with this further and seeing how much I can optimize that WebAssembly performance.

#How It Works

#The Results

#Links