Running DramaBox on Strix Halo

DramaBox is Resemble AI’s prompt-driven TTS model, built as an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only DiT. Unlike the other TTS models I’ve been benchmarking, DramaBox is designed for expressive speech. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. You write prompts like a screenplay:

A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha,
it is so good to see you!"

An optional 10-second voice reference clones the target timbre on top of whatever emotion and delivery the prompt describes. The model uses Euler flow matching with 30 denoising steps and Gemma 3 12B (4-bit quantized via bitsandbytes) as the text encoder.

This is a substantially bigger model than the others I’ve tested. The DiT transformer alone is 6.6 GB (3.3B parameters), the audio VAE and vocoder are another 1.9 GB, and Gemma adds ~8 GB in 4-bit. Total inference footprint is around 16.5 GB before activations.

#The Setup Challenge

DramaBox is CUDA-only. The codebase hardcodes device="cuda" in several places, uses torch.cuda.empty_cache() and torch.cuda.synchronize() without guards throughout the LTX-2 framework code, and depends on bitsandbytes for Gemma’s 4-bit quantization, which historically hasn’t worked on AMD.

My Strix Halo machine has no NVIDIA GPU. The Radeon 8060S (gfx1151) runs through the kernel’s amdgpu module, and I access GPU compute either through ROCm Docker containers or TheRock pip-packaged nightlies. For this project I used the nightlies, same approach that worked for Unsloth Studio.

#Getting It Running

I cloned the repo and set up two separate venvs, one for CPU baseline and one for GPU with ROCm. Both use Python 3.13 since 3.14 doesn’t have PyTorch wheels.

#CPU Path

Straightforward. CPU-only torch, standard pip installs for the dependencies, bitsandbytes loads the pre-quantized Gemma checkpoint on CPU with its own backend (though the kernels package for optimized CPU gemm conflicts with huggingface_hub<1.0, so I left it out).

The bigger issue was the LTX-2 framework. It calls torch.cuda.synchronize() and torch.cuda.empty_cache() in at least four files across the pipeline, including the model lifecycle context manager, the memory cleanup helper, the layer streaming wrapper, and the prompt encoder’s audio-only mode teardown. I patched all of them to guard with torch.cuda.is_available().

I also hit a dtype mismatch in the vocoder. It calls mel_spec.float() to convert input to fp32, but when the model weights are loaded in bf16 (the default), the conv layers still have bf16 bias tensors. On CUDA this would be handled by autocast, but CPU autocast doesn’t support fp32 as a target dtype. The simplest fix: load everything in fp32 for the CPU path. With 128 GB of RAM, memory isn’t a constraint.

#ROCm GPU Path

This is where the Unsloth work paid off directly. The Unsloth Studio setup had already validated the full stack: TheRock gfx1151 nightlies for PyTorch + ROCm runtime as pip packages, and the bitsandbytes preview build for 4-bit quantization on AMD.

IDX="https://rocm.nightlies.amd.com/v2/gfx1151/"
uv pip install --index-url "$IDX" "rocm[libraries,devel]"
uv pip install --index-url "$IDX" torch torchaudio

For the rest of the Python dependencies, I installed from PyPI separately, because pointing uv or pip at the nightlies index for packages like pydantic or scipy causes build failures (the nightly index doesn’t host those wheels). The critical thing is to verify torch wasn’t replaced with a PyPI CUDA wheel after installing other packages. uv’s resolver will happily swap out ROCm torch for torch==2.12.0+cu130 if it sees a version constraint match from PyPI.

bitsandbytes required the preview wheel from the continuous-release channel:

UV_SKIP_WHEEL_FILENAME_CHECK=1 uv pip install --reinstall --no-deps \
  "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/\
continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl"

Standard bitsandbytes (≤0.49.2) has a 4-bit decode NaN bug on AMD. The preview build fixes this. You also need BNB_ROCM_VERSION=71 because HIP 7.13 looks for libbitsandbytes_rocm713.so but the preview ships libbitsandbytes_rocm71.so.

With all of that in place, the model loads on the Radeon 8060S with Gemma in 4-bit taking 7.8 GB of VRAM. The DiT transformer and audio components load alongside it in fp16.

#Benchmarks

I wrote a benchmark script that loads all models warm (single load, reused across runs) and measures each pipeline stage: prompt encoding (Gemma), voice reference encoding (if used), denoising (30 Euler steps), and audio decoding (VAE + vocoder). Three text lengths, following the DramaBox prompt format, and consistent with how I’ve benchmarked the other models.

#CPU (fp32, 16 threads, PyTorch 2.8.0+cpu)

Text	Prompt	Denoise	Decode	Total	Audio	RTF
short (50 chars)	18.6s	115.0s	2.5s	136.0s	5.8s	23.6
medium (183 chars)	14.5s	199.1s	11.6s	225.2s	14.4s	15.6
long (450 chars)	15.4s	639.4s	55.7s	710.5s	35.5s	20.0

This is not practical. 136 seconds to generate 5.8 seconds of audio. The 3.3B parameter DiT running 30 denoising steps in fp32 on CPU just doesn’t have the throughput. The denoise step alone takes 3.8-21 seconds per iteration depending on sequence length.

#ROCm GPU (fp16, PyTorch 2.11.0+rocm7.13, bitsandbytes 0.50.0.dev0)

1 warmup run + 3-run averages:

Text	Prompt	Denoise	Decode	Total	Audio	RTF
short (50 chars)	5.0s	29.1s	4.9s	39.3s	5.8s	6.8
medium (183 chars)	5.0s	44.6s	12.3s	61.9s	14.4s	4.3
long (450 chars)	5.0s	105.8s	30.4s	141.2s	35.5s	4.0

The GPU path is 3.5-5.0x faster end-to-end, and the denoising speedup is even better at 3.9-6.0x. The improvement scales with sequence length because longer latent sequences give the GPU more parallelism. Each denoising step takes a consistent 1.49s for short text and 3.53s for long text, remarkably stable across runs.

#Warmup Tax

The first generation on GPU is expensive. MIOpen needs to discover kernel configurations for the model’s specific tensor dimensions, and the GemmFwdRest solver churns through workspace allocation checks. First-run times: 256s for short text, 605s for long text. After that, every subsequent run is stable. This is consistent with what I’ve seen on every ROCm model, warmup is not optional if you want realistic benchmarks.

#The Decode Bottleneck

The audio decode (VAE + vocoder) is a significant chunk of the total pipeline: 30.4 seconds out of 141.2 for long text, about 22%. The vocoder uses conv1d layers that trigger MIOpen workspace warnings, the same pattern that caused GPU hangs with Echo-TTS’s S1-DAC autoencoder. In this case the ops complete successfully in fp16, but the workspace warnings suggest suboptimal kernel selection.

#Optimization: Fewer Steps

DramaBox defaults to 30 Euler steps. That’s a lot of DiT forward passes through a 3.3B model. The obvious question: how far can you reduce step count before quality degrades?

I benchmarked 10, 15, and 20 steps on GPU (same fp16 config, 1 warmup + 3-run averages):

Steps	Short RTF	Medium RTF	Long RTF	Long Denoise
30	6.80	4.30	3.97	105.8s
20	4.77	3.24	2.97	70.3s
15	4.01	2.73	2.48	52.8s
10	3.23	2.22	1.98	35.0s

The per-step time is essentially constant regardless of total step count, so the relationship is linear. 10 steps cuts the denoise time by exactly 3x relative to 30 steps (35.0s vs 105.8s for long text). The prompt encode and decode times stay flat since they’re independent of step count.

At 10 steps the long text RTF drops from 3.97 to 1.98, almost halved. I listened to the outputs at each step count. 10 steps sounds noticeably less refined than 30, the prosody is a bit flat and there’s subtle metallic artifact on certain consonants, but the words are clear and the emotional direction from the prompt still comes through. For batch generation or draft previews, 10-15 steps is very usable.

#Optimization: torch.compile

torch.compile with mode="reduce-overhead" on the velocity model gives a clean 22-24% speedup on the denoising loop, consistent across all sequence lengths:

Config	Short RTF	Medium RTF	Long RTF	Long Denoise
10 steps	3.23	2.22	1.98	35.0s
10 steps + compile	2.89	1.97	1.75	26.7s

Per-step latency drops from 0.88s to 0.68s for short text, and 3.50s to 2.67s for long. The compilation warmup happens during the first generation (absorbed into the MIOpen warmup pass that’s already required), so the effective cost is zero if you’re doing multiple generations.

RTF 1.75 for long text means DramaBox now takes 1.75 seconds to generate 1 second of audio. That’s a 2.3x improvement over the 30-step baseline of 3.97.

#Optimization: Hybrid GPU/CPU Decode (Didn’t Help)

I tested the hybrid strategy that worked well for Echo-TTS, keeping the DiT on GPU for denoising but moving the audio decoder to CPU (fp32) to avoid MIOpen workspace pressure. Results at 10 steps:

Config	Short RTF	Medium RTF	Long RTF
GPU decode (fp16)	3.23	2.22	1.98
CPU decode (fp32)	3.00	2.59	2.51

CPU decode is marginally faster for short text (3.8s vs 4.8s) but significantly slower for medium and long (17.5s vs 12.3s, 48.8s vs 30.4s). Unlike Echo-TTS’s S1-DAC autoencoder, which crashes on GPU entirely, DramaBox’s vocoder runs fine on the Radeon 8060S in fp16. The GPU’s parallelism wins on the larger sequences.

#Where DramaBox Fits

With the optimizations applied, here’s the updated comparison across all TTS models I’ve benchmarked on this machine:

Model	Config	RTF	Notes
Echo-TTS	CPU optimized, 10 steps, long	0.38	Fastest overall
Echo-TTS	GPU hybrid fp16, 10 steps, long	0.52	ROCm container
OmniVoice	8 steps, voice design, CPU	0.56	Simplest setup
VoxCPM.cpp	VoxCPM1.5 Q8_0, CPU	1.23	GGUF quantized
OmniVoice	8 steps, voice clone, CPU	1.52	With reference audio
DramaBox	10 steps + compile, long	1.75	Most expressive
DramaBox	30 steps baseline, long	4.0	Default config

DramaBox is still the slowest, but RTF 1.75 puts it much closer to practical territory than the initial 4.0. It’s doing fundamentally more than the others, a 3.3B parameter model with explicit emotional control through prompting (laughs, sighs, pauses, tone shifts), things none of the other models support. On NVIDIA hardware (H100), Resemble claims ~2.5 seconds per generation, which would put it solidly under real-time.

The optimization stack is additive: reduced steps gives ~2x, torch.compile adds another ~1.2x on top. Together they turn a 141-second generation into a 62-second one for 35 seconds of audio.

#Practical Notes

A few things worth flagging if you try this yourself:

bitsandbytes versioning is a mess. The preview wheel filename says 1.33.7.preview but the installed package reports version 0.50.0.dev0. uv will complain about the mismatch unless you set UV_SKIP_WHEEL_FILENAME_CHECK=1. This is the same across the bitsandbytes continuous-release builds.

The kernels package conflicts with huggingface_hub<1.0. DramaBox pins huggingface_hub below 1.0, but the kernels package (which bitsandbytes tries to use for optimized CPU 4-bit forward) requires newer huggingface_hub internals. Don’t install it, bitsandbytes works fine without it.

fp16 works, bf16 untested on the full model. Basic bf16 matmul passes on the TheRock nightlies, which is an improvement over older ROCm versions where bf16 caused hard GPU hangs. But I didn’t test full DramaBox inference in bf16. Given Echo-TTS’s experience with bf16 on gfx1151, fp16 is the safe choice.

Model downloads total ~16.5 GB. The auto-downloader in src/model_downloader.py handles everything, but budget for the disk space.

torch.compile warmup overlaps with MIOpen warmup. The first generation is already slow due to MIOpen kernel discovery. torch.compile’s graph compilation happens during that same warmup pass, so it doesn’t add a separate penalty. After warmup, every run benefits from the compiled graph.

#Links

Model: ResembleAI/Dramabox
Code: resemble-ai/DramaBox
Base model: Lightricks/LTX-2.3
TheRock gfx1151 nightlies: rocm.nightlies.amd.com
Echo-TTS comparison: Benchmarking Echo-TTS on Strix Halo
Unsloth Studio (ROCm setup reference): Unsloth Studio on Strix Halo