VoxCPM2 is a tokenizer-free text-to-speech system from OpenBMB that skips the usual discrete audio tokenization step entirely. Instead, it generates continuous speech representations via an end-to-end diffusion autoregressive architecture built on a MiniCPM-4 backbone. It’s a 2B parameter model trained on over 2 million hours of multilingual data, supporting 30 languages with 48kHz studio-quality output. It can also design entirely new voices from text descriptions alone, no reference audio needed. That last part is what caught my attention.

The official benchmarks show an RTF of ~0.3 on an RTX 4090. I wanted to know what this thing does on my Strix Halo machine with no discrete GPU at all — just the Zen 5 CPU and an integrated Radeon 8060S that can’t currently run PyTorch.

Two Paths to Inference

VoxCPM has an official Python package that wraps the full PyTorch model, and a community C++ implementation called VoxCPM.cpp that builds on ggml (same foundation as llama.cpp and whisper.cpp) and supports quantized GGUF weights. I set up both.

The Python path is straightforward: uv venv --python 3.13, uv pip install -e ".", and you’re running. System Python 3.14 doesn’t work — Torch wheels don’t exist for it yet. Same story as OmniVoice.

The C++ path required a bit more work. The upstream CMakeLists.txt has a dependency bug — nlohmann/json is only fetched when building tests, but server_common.cpp (part of the core library) requires it unconditionally. Without fixing this, both CPU and Vulkan builds fail. Two changes: move the FetchContent block outside the test guard, and link the json target to the library. After that, cmake --build produces the voxcpm_tts binary cleanly with AVX-512, OpenMP, and native Zen 5 optimizations.

A Python Bug Along the Way

The first run of VoxCPM2 in Python crashed during the model’s warmup step with an IndexError: Dimension out of range inside scaled_dot_product_attention. The model uses enable_gqa=True (grouped-query attention) with a 1D attention mask, which worked on whatever PyTorch version the authors tested but breaks on PyTorch 2.11 CPU. The fix is one line — reshape the mask from 1D to 4D:

attn_mask = (torch.arange(key_cache.size(2), device=key_cache.device) <= position_id).view(1, 1, 1, -1)

After that, everything runs.

CPU Benchmarks: VoxCPM.cpp

I downloaded three GGUF models from the community weights repo: VoxCPM1.5 Q8_0 (984 MB), VoxCPM2 Q4_K (1.5 GB), and VoxCPM2 Q8_0 (2.5 GB), all with AudioVAE in F16 precision.

Running on 16 threads with 10 inference timesteps and CFG 2.0, across 3 runs each:

ModelQuantSizeDecode SpeedFull Pipeline RTF
VoxCPM1.5Q8_0+AVAE-F16984 MB18.9 it/s1.23
VoxCPM2Q4_K+AVAE-F161.5 GB11.0 it/s1.55
VoxCPM2Q8_0+AVAE-F162.5 GB9.0 it/s1.66

VoxCPM1.5 Q8_0 is the fastest — generating about 9 seconds of audio in 11 seconds of wall time. The 2B VoxCPM2 model at Q4_K quantization is surprisingly close despite being over 3x the parameter count, because the aggressive quantization keeps it memory-bandwidth-friendly.

One thing worth noting: AudioVAE decode takes a constant 3.0–3.5 seconds regardless of model size. For short utterances, that’s a significant chunk of the pipeline. Model-only RTF (before AudioVAE) is around 0.80 for VoxCPM1.5 Q8_0 — solidly faster than real-time for the autoregressive part alone.

The author’s published benchmarks were run on an i5-12600K with 8 threads, where VoxCPM1.5 Q8_0 gets an RTF of 5.25. This Strix Halo machine does the same workload at RTF 1.23 — 4.3x faster. It’s within roughly 2x of their RTX 4060 Ti CUDA numbers (0.56 RTF), which is remarkable for CPU-only inference.

CPU Benchmarks: VoxCPM2 Python

The full PyTorch model is heavier but gives you everything VoxCPM2 offers: 30 languages, voice design, controllable cloning, and native 48kHz output.

ConfigText LengthAvg RTFAvg Time
timesteps=10short (53 chars)1.937.13s
timesteps=10medium (248 chars)1.6826.48s
timesteps=10long (365 chars)1.5835.10s
timesteps=5short (53 chars)1.063.23s
timesteps=5medium (248 chars)1.2519.81s
timesteps=5long (365 chars)1.2528.07s

At 5 timesteps, VoxCPM2 achieves near-real-time on this CPU — an RTF of 1.06 for short text means the audio generates almost as fast as it plays. Longer text amortizes fixed overhead better, converging around 1.25 RTF. At 10 timesteps the quality is noticeably better but the RTF creeps up to 1.6–1.9, which is still perfectly usable for non-interactive generation.

The model runs in bf16 internally and generates at about 6 iterations per second through the diffusion loop. It’s not going to win any speed records against the RTX 4090’s 0.3 RTF, but it’s practical.

GPU Acceleration: Blocked on Both Fronts

I built VoxCPM.cpp with Vulkan support. The backend detected and initialized the Radeon 8060S perfectly — cooperative matrix support, FP16, the works. But the moment it hits the decode loop, it crashes:

ggml_vk_glu: GGML_ASSERT(ggml_is_contiguous(src0)) failed

The GLU (Gated Linear Unit) operation in ggml’s Vulkan backend assumes contiguous tensors, but VoxCPM’s architecture produces non-contiguous ones at this point. This isn’t a hardware issue — it would crash the same way on any Vulkan GPU. It’s an upstream ggml limitation that will presumably be fixed eventually.

ROCm I didn’t attempt for VoxCPM specifically. My OmniVoice and Fish Audio experiments already documented the full state of play: ROCm 6.x stable wheels lack gfx1151, ROCm 7.0 nightlies have the target but segfault, and the Docker ROCm container works but puts the iGPU roughly on par with CPU for these TTS workloads due to the shared memory architecture. There’s no reason to expect VoxCPM2 would behave differently.

VoxCPM vs OmniVoice on This Machine

Since I’ve now benchmarked OmniVoice and VoxCPM on the same hardware, here’s how they compare:

ModelConfigRTFLanguagesOutput Rate
OmniVoice8 steps, voice design0.56600+16 kHz
OmniVoice8 steps, voice clone1.52600+16 kHz
VoxCPM.cpp (1.5)Q8_0, 10 timesteps1.23244.1 kHz
VoxCPM2 Python5 timesteps1.06–1.253048 kHz
VoxCPM2 Python10 timesteps1.58–1.933048 kHz

OmniVoice’s voice design mode is unbeatable for raw speed — faster than real-time at 0.56 RTF. But VoxCPM2 at reduced timesteps is competitive with OmniVoice’s clone mode while offering voice design from text descriptions, 30-language support, and dramatically higher output quality at 48kHz.

Practical Takeaway

VoxCPM2 is genuinely usable for near-real-time TTS on a high-end CPU without any GPU acceleration. The Python package with 5 diffusion timesteps is the sweet spot for most use cases — you get the full feature set (multilingual, voice design, cloning) with an RTF barely above 1.0. For raw speed on English and Chinese, VoxCPM.cpp with the VoxCPM1.5 Q8_0 GGUF model at under 1 GB is hard to beat.

The Vulkan path is tantalizingly close — the GPU initializes, the device is detected, cooperative matrices are ready — but until ggml fixes the GLU contiguity assertion, CPU is the only option. I’ll be watching that repo.