Benchmarking dots.tts on Strix Halo

dots.tts dropped last week from RedNote’s HiLab. It’s a 2B parameter fully continuous autoregressive TTS system, no discrete audio tokens anywhere in the pipeline. The backbone is a Qwen2.5-1.5B LLM paired with a semantic encoder and an autoregressive flow-matching head over a 48 kHz AudioVAE. On their published benchmarks it posts the best average Seed-TTS-Eval scores among open-source models and the highest speaker similarity on the MiniMax 24-language test set.

Three checkpoints are available under Apache-2.0: a pretrained base, a Self-Corrective Alignment (SCA) variant called dots.tts-soar optimized for zero-shot fidelity, and a MeanFlow-distilled variant called dots.tts-mf that fuses classifier-free guidance into the student network for low-latency inference at 2-4 NFE. I benchmarked the latter two on the same Strix Halo machine I’ve been using for my local TTS comparison series.

#Setup

The repo provides a standard pyproject.toml with a constraints file for pinned versions. The install is clean:

git clone https://github.com/rednote-hilab/dots.tts.git ~/Code/dots.tts
cd ~/Code/dots.tts
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -e . -c constraints/recommended.txt

This pulls PyTorch 2.8, transformers 4.57, and the full dependency tree including librosa, WeTextProcessing, and the lingua language detector. Total install is about 1.2 GB of wheels.

The runtime auto-detects CUDA if available. On this machine it landed on the RTX 4070 Super (11.6 GB VRAM), which matters because dots.tts in float32 with voice conditioning eats roughly 10-11 GB during generation. The Radeon 8060S was not used for these runs.

Model weights download from HuggingFace on first run. Each checkpoint is about 8.8 GB (the full 2B model in safetensors format, plus the AudioVAE, speaker encoder, and tokenizer). Subsequent runs load from the HF cache in about 15 seconds.

#Benchmark Method

Same methodology as the MOSS-TTS, Echo-TTS, and DramaBox benchmarks: three text lengths, 1 warmup run per prompt, 3 measured runs, JSON + WAV artifacts. Voice cloning enabled using a reference audio clip (x-vector-only mode, no prompt transcript).

I wrote a benchmark script (scripts/benchmark_dots_tts.py) that calls the DotsTtsRuntime Python API directly, matching the CLI’s default parameters: guidance_scale=1.2, speaker_scale=1.5, ode_method=euler.

Two configurations:

dots.tts-soar at num_steps=10 (quality-oriented, standard flow-matching)
dots.tts-mf at num_steps=4 (MeanFlow distilled, guidance fused into student)

Both ran with 16 threads, float32 precision, on the RTX 4070 Super.

#dots.tts-soar (10 Steps)

Text	Gen Time	Audio Duration	RTF
Short (58 chars)	2.40s	3.63s	0.66
Medium (211 chars)	10.52s	11.04s	0.95
Long (338 chars)	25.82s	18.67s	1.38

Short text is well under real-time. Medium is on the edge at RTF 0.95. Long text crosses over at RTF 1.38, taking about 26 seconds to generate 19 seconds of audio.

The generation time scaling is interesting. dots.tts is autoregressive, so each audio patch requires a full LLM forward pass plus a multi-step flow-matching decode. Longer text produces more audio patches, and unlike diffusion-based models where generation time is constant, here it grows roughly linearly with output length. The per-patch cost also increases as the KV cache grows, which explains why RTF gets worse on longer prompts rather than improving.

#dots.tts-mf (4 Steps, MeanFlow)

Text	Gen Time	Audio Duration	RTF
Short (58 chars)	1.24s	3.52s	0.35
Medium (211 chars)	4.74s	11.63s	0.41
Long (338 chars)	8.27s	17.81s	0.46

MeanFlow changes the picture entirely. By fusing CFG into the student network, each audio patch needs a single model evaluation per step instead of the conditional + unconditional pair. Combined with 4 steps instead of 10, the per-patch cost drops by roughly 5x.

Short text at RTF 0.35 means 3.5 seconds of audio in 1.2 seconds. Long text at RTF 0.46 generates 18 seconds of speech in about 8 seconds. Every prompt length stays well under real-time.

The model load time for MeanFlow was notably higher (90 seconds vs 15 seconds for soar). This appears to be a one-time cost on first load from the HF cache, likely related to the distilled checkpoint’s initialization. Subsequent generations are fast.

#bf16 Doesn’t Work (Yet)

The README recommends bfloat16 precision, and the Python API defaults to it. On CPU this triggers a dtype mismatch in the speaker conditioning path: the CAM++ x-vector encoder outputs float32 embeddings that hit a bfloat16 linear projection, and PyTorch raises a RuntimeError: mat1 and mat2 must have the same dtype. This is an upstream bug in the model’s dtype handling when the speaker encoder’s output isn’t cast to match the core model’s precision.

On GPU the runtime converts everything to CUDA and the mismatch doesn’t surface, but I ran in float32 for consistency and because bf16 GEMM performance on the 4070 Super isn’t meaningfully different from fp32 for a model this size.

#VRAM Limits

The 4070 Super’s 12 GB constrains generation length. At 190 audio patches (roughly 30 seconds of speech), the BigVGAN vocoder decode OOM’d during the upsampling convolutions. The SnakeBeta activations in the residual blocks allocate large intermediate tensors that don’t fit alongside the ~10.7 GB already consumed by the model weights and KV cache.

Keeping prompts under about 350 characters (which produce 18-20 seconds of audio at these speaking rates) avoids the boundary. For longer generation you’d need to either chunk the vocoder decode, move the vocoder to CPU, or use a GPU with more VRAM.

#Where dots.tts Fits

Updated comparison across all TTS models benchmarked on this machine, sorted by best RTF:

Model	Config	RTF	Output Rate	Notes
MOSS-TTS-Nano	8 threads, ONNX CPU	0.23	24 kHz	100M params, voice clone
Echo-TTS	CPU optimized, 10 steps, long	0.38	44.1 kHz	Fastest high-quality
dots.tts-mf	4 steps, GPU, short	0.35	48 kHz	Voice clone, 2B params
dots.tts-mf	4 steps, GPU, long	0.46	48 kHz	Voice clone, 2B params
Echo-TTS	GPU hybrid fp16, 10 steps, long	0.52	44.1 kHz	ROCm container
OmniVoice	8 steps, voice design, CPU	0.56	22 kHz	No reference audio
dots.tts-soar	10 steps, GPU, short	0.66	48 kHz	Voice clone, 2B params
dots.tts-soar	10 steps, GPU, medium	0.95	48 kHz	Voice clone, 2B params
VoxCPM.cpp	VoxCPM1.5 Q8_0, CPU	1.23	44.1 kHz	GGUF quantized
OmniVoice	8 steps, voice clone, CPU	1.52	22 kHz	With reference audio
DramaBox	10 steps + compile, long	1.75	48 kHz	Most expressive
MOSS-TTS 8B	16 threads, llama.cpp CPU	2.05	24 kHz	CPU-only

dots.tts-mf slots in right behind the MOSS-TTS-Nano and optimized Echo-TTS CPU paths. The key differentiator is that dots.tts runs at 48 kHz (the highest output rate in this comparison), uses a 2B parameter model (so quality should be meaningfully better than Nano’s 100M), and includes voice cloning in every benchmark run.

The catch is that it requires GPU. MOSS-TTS-Nano and Echo-TTS CPU achieve comparable or better RTF without any GPU involvement. dots.tts on CPU with these prompts would be somewhere in the 8-30x RTF range based on the earlier CPU run that was interrupted, firmly outside real-time territory.

#Practical Takeaway

If you have an NVIDIA GPU with 12+ GB of VRAM and want the best combination of voice quality, speaker similarity, and latency, dots.tts-mf at 4 steps is a strong choice. RTF 0.35-0.46 with voice cloning at 48 kHz, under Apache-2.0, with a Python API that’s about 6 lines of code to use.

For CPU-only setups, MOSS-TTS-Nano (RTF 0.23) and optimized Echo-TTS (RTF 0.38) remain the better options. dots.tts is not designed for CPU inference and the model’s autoregressive architecture means every audio patch needs a full transformer forward pass, which is expensive without GPU acceleration.

The soar checkpoint is worth testing if you care more about output fidelity than latency. The SCA post-training tightens speaker similarity and text adherence compared to the base model, and their published Seed-TTS-Eval numbers put it at the top of the open-source leaderboard. RTF 0.66-0.95 for short-to-medium text is still usable for many applications.

Next up: testing with longer reference audio clips for continuation-mode cloning (which should improve speaker fidelity), and checking whether the vocoder OOM can be worked around with chunked decode or a hybrid GPU/CPU strategy similar to what worked for Echo-TTS.

#Links

GitHub: rednote-hilab/dots.tts
Technical report: arXiv 2606.07080
Checkpoints: dots.tts-soar, dots.tts-mf, dots.tts-base
Demo page: rednote-hilab.github.io/dots.tts-demo
Previous: Running MOSS-TTS on Strix Halo
Previous: Optimizing Echo-TTS: CPU Beats GPU
Previous: Benchmarking DramaBox on Strix Halo