A Practical, Fully Local Desktop Voice Agent

Everyone kept saying 2025 was supposed to be the “year of agents.” But honestly, I have absolutely no interest in building an agent that goes off and tries to file my taxes. I just wanted to build a natural language-powered voice control for my desktop.

Why? Because sometimes I’m sitting at my computer with my dog in my arms, and I don’t feel like trying to mess with my split keyboard just to pause a video or launch Firefox. That’s the actual niche use case I am trying to fulfill here.

So for the last week, I’ve been working on this local agent platform.

#The UI and Core Stack

The application itself is built in Qt6, giving me a nice system tray icon. But the actual user interface is just an instance of Chromium running inside Qt6 via its native web browser component. I do this because iterating on a web-based UI is infinitely faster and easier than building native Qt interfaces.

For the pipeline, I’m doing a mix of WebAssembly (Wasm) and native execution:

Voice Activity Detection (VAD): Running in WebAssembly, because setting it up in Wasm is actually a lot easier than dealing with Python dependencies.
Speech-to-Text (STT): Still using Moonshine, which is incredibly fast (around 90ms for a quick command).
The LLM: Hosted via LM Studio, giving me an OpenAI-compatible API backend. I am deliberately using a super tiny 1.7B parameter model.
Text-to-Speech (TTS): Running natively using Kokoro.

#Fixing STT with an LLM

One of the problems with using super small transcription models is that they can be inaccurate, especially depending on background noise or my own lazy pronunciation. If I say “Computer, stop,” it might transcribe it as “Computer, sub.”

To fix this, I added an optional “Speech-to-Text Cleanup” step. Before doing anything else, it passes the raw transcription to the LLM and basically says, “Clean this up and fix any out-of-context words.” It easily corrects “sub” to “stop,” saving a ton of headaches down the line.

#The Magic: Tool Calling on a 1.7B Model

Here is the really interesting part. Standard LLM function/tool calling is basically useless on models this small. If you just give a 1.7B model a massive list of functions and say “figure it out,” it’s going to fail. I really wanted to use these tiny models to keep things blazing fast, so I had to offload the heavy lifting.

Instead of traditional function calling, I’m using Vector Embeddings.

I define my tools (Name, Parameters, Description).
I pass those definitions into a vector embedding model.
When I issue a voice command like “Stop music”, I query the vector database to see which tool most closely matches my intent.
The database returns the media_control tool with a high confidence score.
Then, I pass just that specific tool to the 1.7B LLM with a strict prompt: “Here is the tool. Fill in the blanks for these parameters based on the user’s request.” The LLM easily figures out that the action parameter should be stop.

I even added the ability to include negative prompts for tools, so I can explicitly tell the system when not to match a specific command.

#Tools and Hard Commands

Right now I’ve implemented a few basic tools, like media controls, weather, and an app launcher. The app launcher is pretty neat, it dynamically reads all the standard Linux .desktop files on my system, parses them into a dictionary, and allows the LLM to launch any app by matching its desktop shortcut name.

However, sometimes you don’t need the AI at all. I built in a set of hardcoded commands like “Computer mute” or “Computer unmute”. If the system hears those exact phrases, it completely bypasses the LLM and executes the system command immediately. It’s a great workaround for the inefficiencies of running AI for a task that only requires a simple boolean flip.

I’m really enjoying working on this. It’s surprisingly powerful for how small the models are, and it finally gives me a way to control my machine when my hands are tied up.

#Links

Walkthrough on YouTube