Benchmarking dots.tts on Strix Halo
RedNote's 2B continuous autoregressive TTS hits RTF 0.35 on the NVIDIA 4070 Super with the MeanFlow-distilled checkpoint, putting it among the fastest voice-cloning-capable models I've tested locally.
dots.tts dropped last week from RedNote’s HiLab. It’s a 2B parameter fully continuous autoregressive TTS system, no discrete audio tokens anywhere in the pipeline. The backbone is a Qwen2.5-1.5B LLM paired with a semantic encoder and an autoregressive flow-matching head over a 48 kHz AudioVAE. On their published benchmarks it posts the best average Seed-TTS-Eval scores among open-source models and the highest speaker similarity on the MiniMax 24-language test set.
Three checkpoints are available under Apache-2.0: a pretrained base, a Self-Corrective Alignment (SCA) variant called dots.tts-soar optimized for zero-shot fidelity, and a MeanFlow-distilled variant called dots.tts-mf that fuses classifier-free guidance into the student network for low-latency inference at 2-4 NFE. I benchmarked the latter two on the same Strix Halo machine I’ve been using for my local TTS comparison series.
Setup
The repo provides a standard pyproject.toml with a constraints file for pinned versions. The install is clean:
git clone https://github.com/rednote-hilab/dots.tts.git ~/Code/dots.tts
cd ~/Code/dots.tts
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -e . -c constraints/recommended.txt
This pulls PyTorch 2.8, transformers 4.57, and the full dependency tree including librosa, WeTextProcessing, and the lingua language detector. Total install is about 1.2 GB of wheels.
The runtime auto-detects CUDA if available. On this machine it landed on the RTX 4070 Super (11.6 GB VRAM), which matters because dots.tts in float32 with voice conditioning eats roughly 10-11 GB during generation. The Radeon 8060S was not used for these runs.
Model weights download from HuggingFace on first run. Each checkpoint is about 8.8 GB (the full 2B model in safetensors format, plus the AudioVAE, speaker encoder, and tokenizer). Subsequent runs load from the HF cache in about 15 seconds.
Benchmark Method
Same methodology as the MOSS-TTS, Echo-TTS, and DramaBox benchmarks: three text lengths, 1 warmup run per prompt, 3 measured runs, JSON + WAV artifacts. Voice cloning enabled using a reference audio clip (x-vector-only mode, no prompt transcript).
I wrote a benchmark script (scripts/benchmark_dots_tts.py) that calls the DotsTtsRuntime Python API directly, matching the CLI’s default parameters: guidance_scale=1.2, speaker_scale=1.5, ode_method=euler.
Two configurations:
- dots.tts-soar at
num_steps=10(quality-oriented, standard flow-matching) - dots.tts-mf at
num_steps=4(MeanFlow distilled, guidance fused into student)
Both ran with 16 threads, float32 precision, on the RTX 4070 Super.
dots.tts-soar (10 Steps)
| Text | Gen Time | Audio Duration | RTF |
|---|---|---|---|
| Short (58 chars) | 2.40s | 3.63s | 0.66 |
| Medium (211 chars) | 10.52s | 11.04s | 0.95 |
| Long (338 chars) | 25.82s | 18.67s | 1.38 |
Short text is well under real-time. Medium is on the edge at RTF 0.95. Long text crosses over at RTF 1.38, taking about 26 seconds to generate 19 seconds of audio.
The generation time scaling is interesting. dots.tts is autoregressive, so each audio patch requires a full LLM forward pass plus a multi-step flow-matching decode. Longer text produces more audio patches, and unlike diffusion-based models where generation time is constant, here it grows roughly linearly with output length. The per-patch cost also increases as the KV cache grows, which explains why RTF gets worse on longer prompts rather than improving.
dots.tts-mf (4 Steps, MeanFlow)
| Text | Gen Time | Audio Duration | RTF |
|---|---|---|---|
| Short (58 chars) | 1.24s | 3.52s | 0.35 |
| Medium (211 chars) | 4.74s | 11.63s | 0.41 |
| Long (338 chars) | 8.27s | 17.81s | 0.46 |
MeanFlow changes the picture entirely. By fusing CFG into the student network, each audio patch needs a single model evaluation per step instead of the conditional + unconditional pair. Combined with 4 steps instead of 10, the per-patch cost drops by roughly 5x.
Short text at RTF 0.35 means 3.5 seconds of audio in 1.2 seconds. Long text at RTF 0.46 generates 18 seconds of speech in about 8 seconds. Every prompt length stays well under real-time.
The model load time for MeanFlow was notably higher (90 seconds vs 15 seconds for soar). This appears to be a one-time cost on first load from the HF cache, likely related to the distilled checkpoint’s initialization. Subsequent generations are fast.
bf16 Doesn’t Work (Yet)
The README recommends bfloat16 precision, and the Python API defaults to it. On CPU this triggers a dtype mismatch in the speaker conditioning path: the CAM++ x-vector encoder outputs float32 embeddings that hit a bfloat16 linear projection, and PyTorch raises a RuntimeError: mat1 and mat2 must have the same dtype. This is an upstream bug in the model’s dtype handling when the speaker encoder’s output isn’t cast to match the core model’s precision.
On GPU the runtime converts everything to CUDA and the mismatch doesn’t surface, but I ran in float32 for consistency and because bf16 GEMM performance on the 4070 Super isn’t meaningfully different from fp32 for a model this size.
VRAM Limits
The 4070 Super’s 12 GB constrains generation length. At 190 audio patches (roughly 30 seconds of speech), the BigVGAN vocoder decode OOM’d during the upsampling convolutions. The SnakeBeta activations in the residual blocks allocate large intermediate tensors that don’t fit alongside the ~10.7 GB already consumed by the model weights and KV cache.
Keeping prompts under about 350 characters (which produce 18-20 seconds of audio at these speaking rates) avoids the boundary. For longer generation you’d need to either chunk the vocoder decode, move the vocoder to CPU, or use a GPU with more VRAM.
Where dots.tts Fits
Updated comparison across all TTS models benchmarked on this machine, sorted by best RTF:
| Model | Config | RTF | Output Rate | Notes |
|---|---|---|---|---|
| MOSS-TTS-Nano | 8 threads, ONNX CPU | 0.23 | 24 kHz | 100M params, voice clone |
| Echo-TTS | CPU optimized, 10 steps, long | 0.38 | 44.1 kHz | Fastest high-quality |
| dots.tts-mf | 4 steps, GPU, short | 0.35 | 48 kHz | Voice clone, 2B params |
| dots.tts-mf | 4 steps, GPU, long | 0.46 | 48 kHz | Voice clone, 2B params |
| Echo-TTS | GPU hybrid fp16, 10 steps, long | 0.52 | 44.1 kHz | ROCm container |
| OmniVoice | 8 steps, voice design, CPU | 0.56 | 22 kHz | No reference audio |
| dots.tts-soar | 10 steps, GPU, short | 0.66 | 48 kHz | Voice clone, 2B params |
| dots.tts-soar | 10 steps, GPU, medium | 0.95 | 48 kHz | Voice clone, 2B params |
| VoxCPM.cpp | VoxCPM1.5 Q8_0, CPU | 1.23 | 44.1 kHz | GGUF quantized |
| OmniVoice | 8 steps, voice clone, CPU | 1.52 | 22 kHz | With reference audio |
| DramaBox | 10 steps + compile, long | 1.75 | 48 kHz | Most expressive |
| MOSS-TTS 8B | 16 threads, llama.cpp CPU | 2.05 | 24 kHz | CPU-only |
dots.tts-mf slots in right behind the MOSS-TTS-Nano and optimized Echo-TTS CPU paths. The key differentiator is that dots.tts runs at 48 kHz (the highest output rate in this comparison), uses a 2B parameter model (so quality should be meaningfully better than Nano’s 100M), and includes voice cloning in every benchmark run.
The catch is that it requires GPU. MOSS-TTS-Nano and Echo-TTS CPU achieve comparable or better RTF without any GPU involvement. dots.tts on CPU with these prompts would be somewhere in the 8-30x RTF range based on the earlier CPU run that was interrupted, firmly outside real-time territory.
Practical Takeaway
If you have an NVIDIA GPU with 12+ GB of VRAM and want the best combination of voice quality, speaker similarity, and latency, dots.tts-mf at 4 steps is a strong choice. RTF 0.35-0.46 with voice cloning at 48 kHz, under Apache-2.0, with a Python API that’s about 6 lines of code to use.
For CPU-only setups, MOSS-TTS-Nano (RTF 0.23) and optimized Echo-TTS (RTF 0.38) remain the better options. dots.tts is not designed for CPU inference and the model’s autoregressive architecture means every audio patch needs a full transformer forward pass, which is expensive without GPU acceleration.
The soar checkpoint is worth testing if you care more about output fidelity than latency. The SCA post-training tightens speaker similarity and text adherence compared to the base model, and their published Seed-TTS-Eval numbers put it at the top of the open-source leaderboard. RTF 0.66-0.95 for short-to-medium text is still usable for many applications.
Next up: testing with longer reference audio clips for continuation-mode cloning (which should improve speaker fidelity), and checking whether the vocoder OOM can be worked around with chunked decode or a hybrid GPU/CPU strategy similar to what worked for Echo-TTS.
Links
- GitHub: rednote-hilab/dots.tts
- Technical report: arXiv 2606.07080
- Checkpoints: dots.tts-soar, dots.tts-mf, dots.tts-base
- Demo page: rednote-hilab.github.io/dots.tts-demo
- Previous: Running MOSS-TTS on Strix Halo
- Previous: Optimizing Echo-TTS: CPU Beats GPU
- Previous: Benchmarking DramaBox on Strix Halo