Echo-TTS is a diffusion-based text-to-speech model by Jordan Darefsky that uses a DiT (Diffusion Transformer) backbone with joint cross-attention over text and speaker conditioning. It generates up to 30 seconds of 44.1kHz audio per pass, can clone any voice from a short reference clip, and uses a Fish Speech S1-DAC autoencoder to convert latents back into waveforms. I’ve been running through TTS models on my Strix Halo machine for the past few weeks, OmniVoice, VoxCPM2, Fish Audio, and Echo-TTS is the latest to get the full treatment.

#Getting It Running

The repository is CUDA-only out of the box. Every model loading function defaults to device="cuda", the audio loader depends on torchcodec (which needs specific FFmpeg shared libraries), and there’s no device auto-detection. On a machine with no NVIDIA hardware, none of this works.

I patched inference.py to auto-detect the device:

DEFAULT_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

Then changed all the load_model_from_hf, load_fish_ae_from_hf, and load_pca_state_from_hf functions to default to this instead of hardcoded "cuda". I also made the torchcodec import conditional with a torchaudio.load() fallback, since torchcodec’s FFmpeg ABI requirements are a pain in containers. The Gradio app needed the same .cuda() to .to(DEFAULT_DEVICE) treatment.

For the Python environment, the same pattern that worked for OmniVoice and VoxCPM:

uv venv --python 3.13 .venv
source .venv/bin/activate
uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
uv pip install torchcodec huggingface-hub numpy safetensors einops "gradio==5.49.1" soundfile

Model weights are about 4.5 GB total across three HuggingFace repos (echo-tts-base, fish-s1-dac-min, and the PCA state). After first download they cache normally and loads take under a second.

#How Echo-TTS Works (and Why It Matters for Benchmarking)

Echo-TTS is architecturally different from the other models I’ve tested. It’s a pure diffusion model, not autoregressive. The DiT backbone runs a fixed number of Euler steps over a latent sequence of 640 positions (roughly 30 seconds of audio). Each step is a full forward pass through a 24-layer transformer with 2048-dimensional hidden states and joint cross-attention over separate text and speaker KV caches.

The important implication: generation time is almost entirely determined by step count and sequence length, not text length. Whether you feed it 10 words or 100, the model still processes all 640 latent positions for the same number of steps. Short text just produces more silence/padding that gets trimmed. This means RTF improves dramatically with longer text, since the audio output gets longer while the compute stays constant.

#CPU Benchmarks

I wrote a benchmark script that tests three text lengths (short at 63 characters, medium at 207, long at 545) across four step counts, with 3 runs each after a warmup pass. All runs on 16 threads, bf16 model precision, 640 sequence length.

StepsShort RTFMedium RTFLong RTFGen Time/Step
53.461.210.55~3.2s
104.681.750.78~2.2s
207.912.661.19~1.7s
4013.574.562.02~1.4s

Long text is faster than real-time starting at 10 steps. At 5 steps, it hits RTF 0.55, which is comparable to OmniVoice’s voice design mode (0.56) on this same machine. Medium text approaches real-time at 5 steps with an RTF of 1.21.

The per-step time decreases slightly as step count increases due to better CPU cache utilization across the repetitive forward passes. Each step takes about 1.4-3.2 seconds depending on total workload.

Short text is impractical at any step count. The model generates maybe 4-5 seconds of audio but takes 16+ seconds even at 5 steps, because the fixed latent sequence (640 positions) gets processed regardless.

#The ROCm Journey: bf16 Hangs the GPU

This is where Echo-TTS diverges from my previous ROCm experiences. The rocm/pytorch:latest container (PyTorch 2.10.0+rocm7.2.2, HIP 7.2) detects the GPU fine and basic matmuls work. But loading Echo-TTS in its default bf16 precision and running inference causes a hard GPU hang:

MIOpen(HIP): Warning [IsEnoughWorkspace] Solver <GemmFwdRest>,
  workspace required: 440401920, provided ptr: 0 size: 0
HW Exception by GPU node-1 reason: GPU Hang

This happened every single time, regardless of SDPA backend configuration. I tried flash attention, efficient attention, math-only, all three disabled. I tried MIOPEN_DEBUG_DISABLE_FIND_DB=1 and PYTORCH_HIP_ALLOC_CONF=expandable_segments:True. Same crash.

This is a new finding. OmniVoice’s bf16 worked fine in the ROCm container after warmup. The difference is likely the model’s linear layer dimensions, Echo-TTS has 2048-wide hidden states with 5888-wide intermediate layers across 24 transformer blocks, and the MIOpen GemmFwd solver can’t allocate the ~420 MB workspace it needs for these layers in bf16 on gfx1151.

#The fp16 Fix and Hybrid Strategy

Switching the DiT model to fp16 eliminates the hang completely. I validated this incrementally, model loading, text encoder KV cache, speaker encoder, full diffusion loop, all stable in fp16.

But the Fish S1-DAC autoencoder still hangs on GPU in any precision. Its convolutional layers hit the same MIOpen workspace issue. The solution: keep the DiT on GPU and the autoencoder on CPU.

model = load_model_from_hf(device="cuda", dtype=torch.float16)
fish_ae = load_fish_ae_from_hf(device="cpu", dtype=torch.float32)
pca_state = load_pca_state_from_hf(device="cuda")

The sampling loop runs on the GPU, then latents transfer to CPU for the autoencoder decode. This adds a data transfer step but the decode was already the bottleneck.

#GPU Hybrid Benchmarks

Running the DiT in fp16 on the Radeon 8060S with the autoencoder decoding on CPU:

StepsShort RTFMedium RTFLong RTF
102.571.060.52
404.322.161.15

Long text at 10 steps hits RTF 0.52, the fastest result I’ve measured on any TTS model on this machine, beating OmniVoice’s 0.56 voice design mode.

The breakdown tells the story clearly:

StepsComponentShortMediumLong
10GPU sample4.5s4.9s6.3s
10CPU decode8.9s9.2s9.1s
40GPU sample17.7s19.6s24.7s
40CPU decode9.4s9.1s9.3s

GPU sampling is 2.2-3.1x faster than CPU for the diffusion part. But the autoencoder decode is a constant ~9 second overhead regardless of text length or step count. At 10 steps, it’s 60% of the total pipeline time. That’s the optimization target, if the S1-DAC autoencoder could run on GPU, the 10-step long text RTF would drop from 0.52 to roughly 0.22.

#The Full TTS Comparison

With four models now benchmarked on the same hardware, here’s where everything stands:

ModelConfigRTFOutput Rate
Echo-TTS10 steps, long, GPU hybrid0.5244.1 kHz
Echo-TTS5 steps, long, CPU0.5544.1 kHz
OmniVoice8 steps, voice design, CPU0.5622 kHz
Echo-TTS10 steps, medium, GPU hybrid1.0644.1 kHz
VoxCPM2 Python5 timesteps, short1.0648 kHz
VoxCPM.cppVoxCPM1.5 Q8_0, CPU1.2344.1 kHz
OmniVoice8 steps, voice clone, CPU1.5222 kHz

Echo-TTS takes the top spot for long-form generation. The caveat is that these benchmarks used no reference audio, so this is comparable to OmniVoice’s voice design mode rather than cloning. Voice cloning adds reference audio encoding overhead that I haven’t benchmarked yet. The other caveat is that Echo-TTS’s advantage with long text comes from its architecture, the fixed-sequence diffusion approach amortizes generation cost over longer outputs in a way that autoregressive models can’t.

#Practical Takeaway

For long-form TTS generation on Strix Halo, Echo-TTS with the GPU hybrid strategy is the fastest option I’ve found. RTF 0.52 at 10 diffusion steps with 44.1kHz output. If you don’t want to deal with ROCm containers, CPU-only at 5-10 steps gets you to 0.55-0.78 RTF for long text, which is still faster than real-time and requires zero container setup.

Short text is where Echo-TTS falls down. The fixed 640-position latent sequence means you’re always paying for 30 seconds worth of compute regardless of how much speech you actually need. For short utterances, VoxCPM2 at 5 timesteps (RTF 1.06) or OmniVoice voice design (RTF 0.56) are better choices.

The bf16 GPU hang on gfx1151 is worth documenting for anyone else trying ROCm on RDNA 3.5. Echo-TTS’s large linear layers (2048x5888) trigger MIOpen workspace allocation failures in bf16 that cause hard GPU hangs, not errors, hangs. Switching to fp16 fixes it completely. This is model-specific, OmniVoice bf16 works fine in the same container.