Optimizing Echo-TTS: CPU Beats GPU

In my previous Echo-TTS post, I got the model running on Strix Halo and benchmarked both CPU-only and GPU hybrid paths. The GPU hybrid (fp16 DiT on the Radeon 8060S, fp32 autoencoder on CPU) hit RTF 0.52 for long text, the fastest TTS result on this machine. But the autoencoder’s 9-second CPU decode was 60% of the pipeline, and short text was unusable due to the fixed 640-frame latent sequence.

I wanted to see how far I could push the CPU-only path. The answer: far enough that it now outperforms the GPU hybrid.

#The Starting Point

Echo-TTS at 10 Euler steps on CPU (bf16, 16 threads):

Text	Total Time	Audio Duration	RTF
short (63 chars)	17.4s	4.3s	3.70
medium (207 chars)	17.6s	12.9s	1.37
long (545 chars)	19.2s	28.5s	0.67

The pipeline has two phases: DiT sampling (~11s) and autoencoder decode (~8s). The DiT dominates on shorter text, and the decode is a constant overhead regardless of text length. Both phases needed work.

#What I Tried

Eight optimizations, tested independently against the baseline. Three worked well for sampling, one transformed the decoder, one fixed the short-text problem, and three did nothing useful.

#Joint CFG: 3 forward passes down to 2

Echo-TTS uses classifier-free guidance (CFG) during sampling. The original implementation runs three forward passes per CFG-active step: one conditional, one with text masked, one with speaker masked. This lets you scale text and speaker guidance independently.

I wrote a new sampler that masks both text and speaker in a single unconditional pass, dropping the batch from 3x to 2x. You lose independent guidance control but gain ~25% fewer forward passes during the CFG-active portion of sampling.

Long text sampling went from 11.4s to 8.5s. For a ~20 line code change, that’s a good trade.

#AE bfloat16: halving memory bandwidth

The Fish S1-DAC autoencoder defaults to float32. It’s a convolutional network, so its bottleneck is memory bandwidth, not compute. Switching to bfloat16 halves the data moved through the LPDDR5X bus on every decode.

Decode time dropped from 7.8s to 4.7s, a 40% reduction. This was the single biggest individual win.

#Narrow CFG window

CFG is only applied when the timestep falls within a configurable range. The default is cfg_min_t=0.5, meaning half the steps use the expensive 3x batch. Raising that to 0.7 means CFG only applies for roughly 30% of steps. Late-schedule steps (low t values) refine details rather than establish structure, so they don’t need guidance as much.

Sampling dropped ~21%. No code changes needed, just a parameter tweak.

#Dynamic sequence length

The DiT always generates 640 latent frames (~30 seconds of audio capacity), regardless of how much text you feed it. Short text produces a few seconds of speech followed by silence that gets trimmed. All that silence computation is wasted.

I added a function that estimates an appropriate frame count from the text byte length: roughly bytes * 3, clamped between 128 and 640, rounded to multiples of 16. Short text now generates 192 frames instead of 640, cutting compute proportionally.

Short text total time went from 17.4s to 5.2s. Medium and long text are unchanged since they need most of the sequence length anyway.

#torch.compile

The compilation infrastructure was already in the codebase. Enabling it gives 12-20% faster sampling after a ~60 second warmup. Requires fixed tensor shapes to avoid recompilation, which the dynamic sequence length feature handles by rounding to aligned sizes.

#What didn’t work

INT8 dynamic quantization was net negative. PyTorch’s torch.ao.quantization API requires fp32 activations (bf16 isn’t supported), and the overhead of running fp32 throughout the model negates any INT8 GEMM speedup. Quality also degraded.

Thread count tuning did nothing because PyTorch already defaults to 16 threads on the 16-core Ryzen.

RoPE frequency caching was negligible. The computation being cached is trivially fast compared to the attention and MLP layers.

#Combined Results

Stacking all five effective optimizations:

Text	Baseline	Optimized	Speedup	Baseline RTF	Optimized RTF
short	17.4s	3.8s	4.6x	3.70	0.88
medium	17.6s	10.1s	1.7x	1.37	0.79
long	19.2s	10.9s	1.8x	0.67	0.38

The long text RTF of 0.38 means the model generates audio 2.6x faster than real-time playback. This is on CPU alone, no GPU involved.

For context, the GPU hybrid path from my previous benchmarks hit RTF 0.52. The optimized CPU path is now 27% faster than the GPU. The reason is straightforward: the GPU hybrid’s autoencoder still runs at fp32 on CPU (9.1s), while the optimized path uses bf16 (4.7s). The sampling reductions from joint CFG and narrow CFG bring CPU sampling time down to match GPU-level performance.

#Quality Evaluation

Performance is meaningless if the audio sounds worse. I generated comparison samples across all configurations using a fixed seed so differences come purely from the optimization.

Five prompts (short sentence, multi-sentence reasoning, long narrative, tongue twisters, casual conversational), generated under baseline, each individual optimization, and the combined stack. I listened to all of them back-to-back.

Zero-speaker mode (no voice cloning): No perceptible quality difference across any combination. Joint CFG, bf16 AE, and narrow CFG all produce output indistinguishable from baseline. The combined configuration sounds identical.

Voice cloning mode: I tested with a ~2 minute reference audio clip, generating the same five prompts under baseline, each optimization individually, and combined.

Config	Long RTF	Voice Quality
baseline	0.73	Full fidelity
joint_cfg only	0.64	No degradation
ae_bf16 only	0.61	No degradation
narrow_cfg only	0.68	Degradation on short clips
combined	0.49	Slight degradation

The narrow CFG window is the culprit. When you reduce guidance coverage from 50% to 30% of steps, shorter clips get proportionally less conditioning on the speaker identity. Joint CFG and bf16 AE are clean, they don’t affect voice cloning quality at all.

The practical split: for zero-speaker generation, use everything including narrow CFG. For voice cloning where speaker fidelity matters, drop narrow CFG and use joint CFG + bf16 AE only (RTF 0.61, no quality loss).

#Updated TTS Comparison

Where this lands relative to other models on the same Strix Halo hardware:

Model	Config	RTF	Output Rate
Echo-TTS	CPU optimized, long	0.38	44.1 kHz
Echo-TTS	GPU hybrid, long	0.52	44.1 kHz
OmniVoice	8 steps, voice design	0.56	22 kHz
Echo-TTS	CPU optimized, voice clone	0.61	44.1 kHz
Echo-TTS	CPU baseline, long	0.67	44.1 kHz
VoxCPM2 Python	5 timesteps, short	1.06	48 kHz

Echo-TTS now dominates the long-form TTS category by a wide margin, and it does it on CPU without needing ROCm containers or GPU setup.

#Practical Takeaway

For long-form generation on Strix Halo, the optimized CPU path is now the best option: RTF 0.38, 44.1 kHz output, no ROCm dependencies, no GPU hang workarounds. Short text also became viable with dynamic sequence length (RTF 0.88 vs the previous 3.70).

For voice cloning, use joint CFG + bf16 AE without narrow CFG (RTF 0.61). The slight quality loss from narrow CFG on shorter clips isn’t worth the marginal speed improvement when speaker fidelity is the goal.

The optimizations themselves are straightforward. Joint CFG is a ~20 line sampler variant. Dynamic sequence length is a one-line heuristic. AE bf16 is a dtype flag. None of these required model retraining or architectural changes, just better utilization of what was already there.

#Links

Model: jordand/echo-tts-base
Code: jordandare/echo-tts
Previous benchmarks: Benchmarking Echo-TTS on Strix Halo
OmniVoice comparison: Benchmarking OmniVoice on Strix Halo