The MOSS-TTS Family from MOSI.AI and the OpenMOSS team is an open-source speech generation suite that takes a different approach than the other TTS models I’ve benchmarked on this machine. Instead of a single architecture, MOSS ships five production models designed for different use cases, from long-form speech generation to real-time voice agents. The flagship model is an 8B parameter transformer using multi-head parallel RVQ prediction with delay-pattern scheduling, and they also offer a 100M parameter Nano variant designed to run on CPU without any GPU at all.

What makes MOSS interesting is the inference flexibility. You can run the full 8B model through HuggingFace Transformers, through a llama.cpp GGUF backend (torch-free), or through SGLang for accelerated serving. The Nano variant additionally ships an ONNX Runtime path that strips out the PyTorch dependency entirely. I tested the llama.cpp GGUF backend for the 8B model and the ONNX Runtime path for Nano on my Strix Halo machine, same hardware as the previous TTS rounds.

#The Models

MOSS-TTS 1.0 is an 8B parameter decoder-only transformer that generates audio tokens via a delay-pattern scheduling mechanism over multiple RVQ codebooks. The audio is decoded through a separate audio tokenizer at 24 kHz output. The model supports zero-shot voice cloning from a reference audio clip and covers 20 languages.

MOSS-TTS-Nano is a 100M parameter model that uses a lightweight architecture with ONNX-exportable operators. It outputs 24 kHz audio with voice cloning and streaming decode support, and is designed to run on a single CPU core. The entire model with its audio tokenizer is about 730 MB on disk.

I also attempted the HuggingFace Transformers path for MOSS-TTS-v1.5 (the latest release with 31 languages and improved prosody control), but the 8B model in full PyTorch requires about 16 GB of RAM just for the weights in fp32, and the download was substantial. I’ll revisit that path another time.

#Setup

The llama.cpp backend needed weights from two HuggingFace repos. The MOSS GGUF quantized weights came from the MOSS-TTS-GGUF collection, and the ONNX audio tokenizer from MOSS-Audio-Tokenizer-ONNX. I downloaded five quantization levels of the backbone GGUF: Q4_K_M (5 GB), Q5_K_M (5.9 GB), Q6_K (6.8 GB), Q8_0 (8.7 GB), and the full F16 (16 GB). All benchmarks used Q4_K_M.

Environment setup followed the same uv pattern:

git clone https://github.com/OpenMOSS/MOSS-TTS.git ~/Code/moss-tts
cd ~/Code/moss-tts
uv venv --python 3.13 .venv
source .venv/bin/activate
pip install -e ".[llama-cpp]"
pip install onnxruntime

The pipeline uses a YAML config that declares model paths, audio tokenizer ONNX files, sampling parameters, and runtime settings:

backbone_gguf: weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf
embedding_dir: weights/MOSS-TTS-GGUF/embeddings
lm_head_dir: weights/MOSS-TTS-GGUF/lm_heads
tokenizer_dir: weights/MOSS-TTS-GGUF/tokenizer

audio_backend: onnx
audio_encoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx
audio_decoder_onnx: weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx

heads_backend: numpy
n_ctx: 4096
n_batch: 256
n_threads: 8
n_gpu_layers: 0
max_new_tokens: 3072

For Nano, the setup was even simpler since it’s pure ONNX with no PyTorch dependency:

git clone https://github.com/OpenMOSS/MOSS-TTS-Nano.git ~/Code/moss-tts-nano
cd ~/Code/moss-tts-nano
uv venv --python 3.13 .venv
source .venv/bin/activate
pip install onnxruntime soundfile

The ONNX model weights (about 730 MB total) are loaded directly into the OnnxTtsRuntime class with configurable thread count and execution provider.

#Benchmark Setup

I wrote two benchmark scripts following the same pattern as my previous TTS benchmarks: three text lengths, configurable warmup and run counts, JSON output with WAV artifacts. Both use a reference audio clip for voice cloning (the repo’s default English reference).

The llama.cpp benchmarks used 768 max_new_tokens with the default sampling parameters (text_temperature 1.5, audio_temperature 1.7, audio_top_k 25). The Nano benchmarks used 375 max_new_frames with streaming decode enabled and fixed sampling mode. Both were run with 1 warmup pass and 3 benchmark runs per configuration.

I ran both models at multiple thread counts to measure CPU scaling behavior on the 32-core Zen 5.

#MOSS-TTS 8B via llama.cpp GGUF

The 8B model via llama.cpp is CPU-only by design. The GGUF backend runs the backbone through llama.cpp’s quantized inference while the audio tokenizer runs via ONNX Runtime on CPU. No GPU involvement at all.

#4 Threads

TextGen TimeAudio DurationRTF
Short (58 chars)16.17s4.51s3.59
Medium (136 chars)26.52s9.41s2.82
Long (296 chars)53.03s22.48s2.36

#8 Threads

TextGen TimeAudio DurationRTF
Short14.35s4.40s3.26
Medium23.99s9.17s2.61
Long48.97s22.27s2.20

#16 Threads

TextGen TimeAudio DurationRTF
Short13.73s4.69s2.93
Medium21.62s9.17s2.36
Long45.02s21.92s2.05

#Thread Scaling

Scaling from 4 to 16 threads improved throughput by 15-18% across all text lengths, which is relatively modest for a 4x thread increase. The 8B model at Q4_K_M is memory-bandwidth-bound on CPU, not compute-bound. Each forward pass is dominated by reading 5 GB of model weights from memory, and throwing more CPU cores at it mostly adds contention for the same memory bus.

Generation time is roughly proportional to output length, which makes sense for an autoregressive model. Each new token requires a full forward pass through the transformer, so longer text means more tokens means more time. This is different from diffusion-based models like Echo-TTS where generation time is constant regardless of text length.

#MOSS-TTS-Nano via ONNX Runtime

Now this is where things get interesting. The 100M parameter Nano model running entirely through ONNX Runtime on CPU produces the fastest TTS results I’ve measured on this machine.

#4 Threads

TextGen TimeAudio DurationRTF
Short1.35s4.40s0.31
Medium2.76s10.13s0.27
Long3.71s14.16s0.26

#8 Threads

TextGen TimeAudio DurationRTF
Short1.22s4.40s0.28
Medium2.46s10.13s0.24
Long3.30s14.16s0.23

#16 Threads

TextGen TimeAudio DurationRTF
Short2.28s4.40s0.52
Medium4.07s10.13s0.40
Long5.51s14.16s0.39

#Thread Scaling

Nano’s scaling behavior is inverted compared to the 8B model. Performance peaks at 8 threads and degrades at 16 threads. The sweet spot is 4-8 threads on this CPU.

At 8 threads, long text generates at RTF 0.23, which means it produces audio about 4.3x faster than real-time. A 14-second voice clip takes 3.3 seconds to generate. Short text is even more impressive at RTF 0.28, generating 4.4 seconds of speech in just 1.2 seconds.

The regression at 16 threads suggests ONNX Runtime’s internal threading model doesn’t benefit from additional parallelism beyond 8 cores on this workload. The model is small enough that it fits in L3 cache, and adding more threads introduces scheduler overhead without any compute benefit.

#Comparison with All Previous TTS Models

Here’s how everything stacks up, sorted by best RTF:

ModelConfigRTFOutput RateNotes
MOSS-TTS-Nano8 threads, ONNX CPU0.2324 kHzFastest by far, voice clone
OmniVoice8 steps, voice design, CPU0.5622 kHzNo reference audio
Echo-TTS5 steps, long, CPU0.5544.1 kHzNo reference audio
Echo-TTS10 steps, long, GPU hybrid0.5244.1 kHzRequires ROCm container
VoxCPM2 Python5 timesteps, short1.0648 kHzNo reference audio
VoxCPM.cppVoxCPM1.5 Q8_0, CPU1.2344.1 kHzVoice clone, GGUF
OmniVoice8 steps, voice clone, CPU1.5222 kHzWith reference audio
MOSS-TTS 8B16 threads, llama.cpp CPU2.0524 kHzVoice clone enabled

MOSS-TTS-Nano at 8 threads on CPU is more than 2x faster than the next best model, and it achieves this with voice cloning enabled (reference audio was used for all Nano benchmarks). The caveat is output quality. Nano is a tiny model and its 24 kHz output isn’t going to match the fidelity of Echo-TTS at 44.1 kHz or VoxCPM2 at 48 kHz. But for conversational TTS, real-time voice agents, or any use case where latency matters more than studio-grade fidelity, Nano on CPU is absurdly fast with no GPU required.

The 8B MOSS-TTS model via llama.cpp is slower than I expected. At RTF 2.05 for long text with 16 threads, it’s about 10x slower than Nano on the same hardware. This is the cost of running an 8B autoregressive decoder on CPU without any GPU acceleration. The model has a llama.cpp first-class implementation with GPU offloading support, which I didn’t test, but based on my Echo-TTS ROCm experience, getting the MIOpen stack working for another model on gfx1151 is a non-trivial investment.

#Practical Takeaways

MOSS-TTS-Nano is the fastest local TTS I’ve tested on Strix Halo, by a wide margin. RTF 0.23 with voice cloning enabled, pure CPU, no GPU required, 730 MB on disk. If you need low-latency speech synthesis for a voice agent, local reading assistant, or any application where generation speed matters more than absolute audio quality, this is the model to use.

The 8B MOSS-TTS model via llama.cpp is usable but not real-time on CPU alone. RTF 2.05 means a 20-second utterance takes about 41 seconds to generate. If you have a GPU that can run the llama.cpp backend with GPU offloading, the story would be different. On CPU-only, it’s a non-starter for latency-sensitive applications.

Thread scaling matters more for small models than large ones. Nano peaks at 8 threads and then degrades, while the 8B model scales modestly up to 16 threads with diminishing returns. The bottleneck shifts from compute (small model, cache-friendly) to memory bandwidth (large model, cache-miss-heavy) as model size increases.

The ONNX Runtime path is worth paying attention to. Nano’s pure ONNX pipeline means no PyTorch, no CUDA, no container setup. A clean uv venv, pip install onnxruntime soundfile, and you’re generating speech. For deployment scenarios where you want a minimal dependency footprint, this is hard to beat.

#What’s Next

I want to revisit the HuggingFace Transformers path for MOSS-TTS-v1.5, which adds 31 languages and better prosody control. The full 8B PyTorch model is a heavy download but would give a more direct comparison with the Transformers-based pipelines I’ve tested for Echo-TTS and OmniVoice. I also want to see if Nano can be pushed even further with the ONNX CUDA execution provider, though at RTF 0.23 there’s not much headroom left to gain.