Benchmarking dots.tts on Strix Halo
RedNote's 2B continuous autoregressive TTS hits RTF 0.35 on the NVIDIA 4070 Super with the MeanFlow-distilled checkpoint, putting it among the fastest voice-cloning-capable models I've tested locally.
read more →RedNote's 2B continuous autoregressive TTS hits RTF 0.35 on the NVIDIA 4070 Super with the MeanFlow-distilled checkpoint, putting it among the fastest voice-cloning-capable models I've tested locally.
read more →BeeLlama.cpp's DFlash speculative decoding nearly triples dense model throughput on AMD Strix Halo, but in a strict head-to-head against my existing MTP setups, MTP still wins by 23-67% depending on the configuration.
read more →Benchmarking OpenMOSS's 8B llama.cpp GGUF backend and 100M ONNX Nano model on AMD's Ryzen AI MAX+ 395, with thread scaling analysis and a surprising Nano result that beats everything else I've tested.
read more →A community PR optimizing CUDA kernels for GFX1151 delivers +24% prefill throughput on MoE models, but combining those same kernel changes with MTP speculative decoding makes inference slower. Not every optimization stacks.
read more →AMD released ROCm 7.13 with Strix Halo optimizations. I benchmarked kyuz0's latest toolbox images against my current ROCm 6.4.4 production baseline to see if upgrading my llama-swap stack is worth it. The answer is complicated.
read more →Getting Resemble AI's expressive TTS model running on AMD Strix Halo with no NVIDIA hardware. TheRock gfx1151 nightlies, bitsandbytes preview for ROCm, reduced step counts, and torch.compile bringing the 3.3B DiT from RTF 4.0 down to 1.75.
read more →Google's official Gemma 4 MTP assistant heads bring speculative decoding to MoE models that couldn't benefit before, and nearly quadruple dense model throughput on AMD Strix Halo's bandwidth-limited unified memory.
read more →Eight optimization attempts on Echo-TTS CPU inference, the five that worked, quality evaluation with voice cloning, and how the optimized CPU path ended up faster than the GPU hybrid.
read more →Multi-Token Prediction turns Qwen 3.6 27B from 6 t/s to 30 t/s on AMD Strix Halo, succeeding where draft models and ngram decoding failed, by using prediction heads baked into the model itself.
read more →Running a diffusion-based TTS model on AMD's Strix Halo, patching CUDA-only code for CPU, discovering a bf16 GPU hang on gfx1151, and a hybrid GPU/CPU trick that beats every other TTS model I've tested.
read more →Porting Tencent's CUDA-only 3D world model to AMD's Radeon 8060S via ROCm Docker, flash-attention CK kernels, a fully compiled gsplat with wave32 patches, and complete 3D reconstruction output including Gaussian splats.
read more →Benchmarking speculative decoding with Gemma 4 E2B as a draft model for Gemma 4 31B on AMD Strix Halo, a bandwidth-bound setup where the optimal draft-max differs from discrete GPUs.
read more →Setting up AMD's Lemonade Server on Strix Halo to run LLM and Whisper inference on the XDNA 2 NPU, driver builds, architecture decisions, and benchmarks against the integrated GPU.
read more →Running a 600+ language zero-shot TTS model on an AMD integrated GPU, voice cloning benchmarks, ROCm compatibility adventures, and the container workaround that actually worked.
read more →Running a 2B parameter tokenizer-free TTS model in both Python and C++ on AMD's integrated GPU, near-real-time speech synthesis on CPU, and the Vulkan crash that stopped GPU acceleration in its tracks.
read more →Running Fish Audio's 4B parameter S2-Pro text-to-speech model locally on an AMD Strix Halo integrated GPU via ROCm and Podman.
read more →