Benchmarking OmniVoice on Strix Halo

OmniVoice is a new zero-shot text-to-speech model from k2-fsa that supports over 600 languages, voice cloning from short audio references, and voice design via text descriptions. It uses a diffusion language model architecture, iterative unmasking across 8 audio codebook layers, and claims RTFs as low as 0.025 on NVIDIA hardware. Naturally, I wanted to see what it does on my Strix Halo machine with no NVIDIA anything.

#Getting It Running

The repo targets CUDA by default. The pyproject.toml pins PyTorch wheels from PyTorch’s cu128 index, which means uv sync immediately fails on an AMD system. It also fails on Python 3.14 since Torch doesn’t ship wheels for that yet.

The fix is straightforward:

uv sync --python 3.13 --no-sources

Skipping the CUDA source mapping lets uv pull vanilla PyTorch wheels, and pinning Python 3.13 gets us compatible Torch builds. Model weights download from Hugging Face on first run, about 2.6 GB for the main model plus the HiggsAudio V2 tokenizer.

#CPU Benchmarks

With no GPU acceleration available natively, I started with CPU inference. OmniVoice has a num_step parameter that controls diffusion quality, fewer steps means faster but rougher output.

Voice design mode (model picks a voice from a text description, no reference audio):

Steps	Avg Generation	Avg RTF
8	2.35s	0.56
16	8.36s	2.01
32	12.11s	3.00

At 8 steps, it’s actually faster than real-time on CPU alone. The Ryzen AI MAX+ 395’s 16 cores handle the iterative decoding reasonably well in float32.

Voice cloning mode (using a 10-second reference audio clip):

Steps	Avg Generation	Avg RTF
8	6.28s	1.52
16	11.72s	2.82
32	22.84s	5.48

Cloning is heavier, the model has to encode the reference audio through the HiggsAudio tokenizer and condition on it during generation. At 8 steps the quality is surprisingly decent for the speed. At 32 steps it’s better but takes over 20 seconds for a few seconds of audio.

#The ROCm Saga

This is where it gets interesting. The Radeon 8060S in Strix Halo reports as gfx1151, RDNA 3.5 silicon. PyTorch’s ROCm wheels have historically not included this target.

ROCm 6.3 and 6.4 stable wheels: The GPU is detected, torch.cuda.is_available() returns True, and device_count shows 1. But the first kernel launch dies with HIP error: invalid device function. The compiled architecture list in these wheels stops at gfx1201, no gfx1151.

I tried every HSA_OVERRIDE_GFX_VERSION value I could think of. None worked. The ISA difference between RDNA 3.5 and what these wheels target is too large to fake.

ROCm 7.0 nightly wheels: These actually include gfx1150 and gfx1151 in their compiled targets. Progress. But the first tensor allocation segfaults. Not a kernel mismatch this time, a hard crash in the HIP runtime. I traced the /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory warning, used bubblewrap to bind-mount the system’s copy into the expected path, and confirmed it’s cosmetic, the segfault persists regardless.

The issue is a userspace mismatch. Fedora 43’s kernel-side KFD and the nightly wheel’s bundled HIP libraries aren’t ABI-compatible.

#The Container Fix

The rocm/pytorch:latest Docker image ships a complete, matched ROCm 7.2.1 userspace stack. With /dev/kfd and /dev/dri passed through:

docker run --rm \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --group-add render \
  rocm/pytorch:latest \
  python -c "import torch; x=torch.randn(64,64,device='cuda'); print((x@x).mean())"

It works. GPU matmul, no segfault, correct results. The hardware is perfectly capable, it’s purely a host runtime packaging problem.

#OmniVoice on the iGPU

Running OmniVoice inside the ROCm container with GPU acceleration, the first run was disappointing, RTF around 5.97. But most of that was JIT compilation overhead. MIOpen needs to find and compile convolution kernels, and OmniVoice’s HiggsAudio decoder triggers a cascade of workspace allocation warnings on first pass.

After adding a single warmup generation to prime the kernel cache, the numbers changed dramatically:

Config	Load Time	Avg Generation	Avg RTF
No warmup	29.9s	22.5s	5.97
+ warmup, fp16	31.6s	6.86s	1.45
+ warmup, fp16, HF cache mount	2.6s	6.90s	1.47

With cached models and warmed kernels, the iGPU matches CPU performance on this workload. The generation time is nearly identical, around 6.9 seconds average for 8-step voice cloning. The load time drops from 30 seconds to under 3 when you mount ~/.cache/huggingface into the container.

I also tested bf16 (RTF 1.52, slightly worse) and AOTriton experimental attention (no meaningful difference). fp16 is the sweet spot.

#Why GPU Doesn’t Win Here

The fact that GPU roughly matches CPU rather than beating it is worth explaining. OmniVoice’s iterative generation loop runs 8 diffusion steps sequentially, each involving a full forward pass through a transformer backbone with classifier-free guidance (so 2x batch size). On a discrete GPU with dedicated VRAM and high memory bandwidth, this parallelizes well. On Strix Halo’s integrated GPU sharing unified memory with the CPU, the memory bandwidth advantage over CPU inference is modest, the same 218 GB/s LPDDR5X bus serves both.

The real GPU advantage would show at larger batch sizes or longer sequences where the compute-to-memory ratio shifts. For single-utterance generation of short sentences, the CPU’s wider execution resources and lack of PCIe/fabric overhead keep it competitive.

#Practical Takeaway

OmniVoice at 8 diffusion steps produces usable voice-cloned speech in about 6 seconds on this hardware, either CPU natively or GPU via Docker. That’s not interactive-speed TTS, but it’s fast enough for batch generation, content creation, or anywhere you can tolerate a few seconds of latency.

The voice cloning quality from a 10-second reference clip is impressive for a model this fast. It captures vocal timbre and cadence well enough that the output is recognizably “that voice” rather than a generic approximation.

For my Strix Halo setup, the practical recommendation is CPU mode with num_step=8 for quick iteration, and the Docker ROCm path when I want to keep the CPU free for other workloads. The model load time penalty in containers is solved by mounting the HF cache, a trick that should be standard practice for any containerized ML workflow on this machine.