I’ve been on a bit of a local inference kick lately, and text-to-speech was the next frontier. Fish Audio S2-Pro is a 4 billion parameter TTS model that produces remarkably natural speech, but it’s designed to run on beefy NVIDIA GPUs. I wanted to run it on my AMD Strix Halo machine’s integrated Radeon 8060S — a 40 CU iGPU sharing 128 GB of unified LPDDR5X with the CPU. No discrete GPU, no CUDA, no problem. Mostly.

The ROCm Problem

The first wall I hit was ROCm compatibility. The standard rocm/dev-ubuntu-24.04 Docker image immediately segfaults on Fedora’s kernel 6.18 — HIP kernel dispatch fails with no kernel image is available for execution on the device. The Ubuntu ROCm userspace simply doesn’t have the patches needed for the 6.18+ KFD interface changes.

The solution was building on top of a Fedora 43 base image with backported ROCm 7.2 packages that actually support modern kernels. From there it was a matter of wiring up Python 3.12, fish-speech, and replacing the CPU PyTorch wheels with AMD’s ROCm-enabled torch 2.8.0+rocm7.2 builds.

Making It Fast(er)

Out of the box, inference was 52.6x slower than real-time. The LLM phase was crawling at 0.9 tokens/sec. Not great.

Enabling torch.compile with AOTriton’s experimental scaled dot-product attention (TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1) was transformative — an 8.5x improvement in LLM throughput, jumping to 7.2–7.6 tokens/sec and pushing memory bandwidth from 4.2 GB/s to ~34 GB/s. That brought the overall real-time factor down to about 30x.

I went down a rabbit hole testing additional optimizations: hipBLASLt, TunableOp, MIOpen exhaustive kernel search, torch.compile’s max-autotune mode. None of them moved the needle. torch.compile with Triton already generates fused kernels that bypass the BLAS library entirely, TunableOp found zero tunable operations, and MIOpen’s better convolution kernels require workspace that the model doesn’t allocate. The current configuration is already near-optimal for this hardware.

The Bottleneck

The remaining ~30x RTF is dominated by the VQ decoder, which converts tokens into audio waveforms and accounts for roughly 90% of wall time. The LLM phase itself is already near the practical memory bandwidth ceiling of this integrated GPU. Further improvement would require architectural changes to fish-speech’s VQ decoder — not something I can tune from the outside.

The Setup

Everything runs in a rootless Podman container with GPU passthrough. There’s a build script, a run script, and a systemd user service for running it persistently. First startup takes 5–7 minutes while the model loads and torch.compile warms up, but subsequent requests use cached compiled kernels.

One quirk worth noting: if you’re sharing the iGPU with other workloads (I run llama.cpp via Vulkan alongside this), you’ll want HSA_ENABLE_SDMA=0 to avoid memory conflicts. It forces shader-based copies instead of the system DMA engine, with no measurable performance penalty.

What’s Next

30x real-time isn’t usable for interactive TTS, but it works fine for batch generation — pre-generating audio for content, building voice datasets, or just experimenting with voice cloning using S2-Pro’s reference audio support. I’m planning to wire this into my other projects where latency isn’t critical.

I’m also keeping an eye on the upstream fish-speech project for VQ decoder improvements, and on AMD’s ROCm stack as it matures on consumer integrated GPUs. This hardware has the memory capacity for serious models — it just needs the software to catch up.