DramaBox is Resemble AI’s prompt-driven TTS model, built as an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only DiT. Unlike the other TTS models I’ve been benchmarking, DramaBox is designed for expressive speech. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses, and transitions. You write prompts like a screenplay:

A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha,
it is so good to see you!"

An optional 10-second voice reference clones the target timbre on top of whatever emotion and delivery the prompt describes. The model uses Euler flow matching with 30 denoising steps and Gemma 3 12B (4-bit quantized via bitsandbytes) as the text encoder.

This is a substantially bigger model than the others I’ve tested. The DiT transformer alone is 6.6 GB (3.3B parameters), the audio VAE and vocoder are another 1.9 GB, and Gemma adds ~8 GB in 4-bit. Total inference footprint is around 16.5 GB before activations.

#The Setup Challenge

DramaBox is CUDA-only. The codebase hardcodes device="cuda" in several places, uses torch.cuda.empty_cache() and torch.cuda.synchronize() without guards throughout the LTX-2 framework code, and depends on bitsandbytes for Gemma’s 4-bit quantization, which historically hasn’t worked on AMD.

My Strix Halo machine has no NVIDIA GPU. The Radeon 8060S (gfx1151) runs through the kernel’s amdgpu module, and I access GPU compute either through ROCm Docker containers or TheRock pip-packaged nightlies. For this project I used the nightlies, same approach that worked for Unsloth Studio.

#Getting It Running

I cloned the repo and set up two separate venvs, one for CPU baseline and one for GPU with ROCm. Both use Python 3.13 since 3.14 doesn’t have PyTorch wheels.

#CPU Path

Straightforward. CPU-only torch, standard pip installs for the dependencies, bitsandbytes loads the pre-quantized Gemma checkpoint on CPU with its own backend (though the kernels package for optimized CPU gemm conflicts with huggingface_hub<1.0, so I left it out).

The bigger issue was the LTX-2 framework. It calls torch.cuda.synchronize() and torch.cuda.empty_cache() in at least four files across the pipeline, including the model lifecycle context manager, the memory cleanup helper, the layer streaming wrapper, and the prompt encoder’s audio-only mode teardown. I patched all of them to guard with torch.cuda.is_available().

I also hit a dtype mismatch in the vocoder. It calls mel_spec.float() to convert input to fp32, but when the model weights are loaded in bf16 (the default), the conv layers still have bf16 bias tensors. On CUDA this would be handled by autocast, but CPU autocast doesn’t support fp32 as a target dtype. The simplest fix: load everything in fp32 for the CPU path. With 128 GB of RAM, memory isn’t a constraint.

#ROCm GPU Path

This is where the Unsloth work paid off directly. The Unsloth Studio setup had already validated the full stack: TheRock gfx1151 nightlies for PyTorch + ROCm runtime as pip packages, and the bitsandbytes preview build for 4-bit quantization on AMD.

IDX="https://rocm.nightlies.amd.com/v2/gfx1151/"
uv pip install --index-url "$IDX" "rocm[libraries,devel]"
uv pip install --index-url "$IDX" torch torchaudio

For the rest of the Python dependencies, I installed from PyPI separately, because pointing uv or pip at the nightlies index for packages like pydantic or scipy causes build failures (the nightly index doesn’t host those wheels). The critical thing is to verify torch wasn’t replaced with a PyPI CUDA wheel after installing other packages. uv’s resolver will happily swap out ROCm torch for torch==2.12.0+cu130 if it sees a version constraint match from PyPI.

bitsandbytes required the preview wheel from the continuous-release channel:

UV_SKIP_WHEEL_FILENAME_CHECK=1 uv pip install --reinstall --no-deps \
  "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/\
continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl"

Standard bitsandbytes (≤0.49.2) has a 4-bit decode NaN bug on AMD. The preview build fixes this. You also need BNB_ROCM_VERSION=71 because HIP 7.13 looks for libbitsandbytes_rocm713.so but the preview ships libbitsandbytes_rocm71.so.

With all of that in place, the model loads on the Radeon 8060S with Gemma in 4-bit taking 7.8 GB of VRAM. The DiT transformer and audio components load alongside it in fp16.

#Benchmarks

I wrote a benchmark script that loads all models warm (single load, reused across runs) and measures each pipeline stage: prompt encoding (Gemma), voice reference encoding (if used), denoising (30 Euler steps), and audio decoding (VAE + vocoder). Three text lengths, following the DramaBox prompt format, and consistent with how I’ve benchmarked the other models.

#CPU (fp32, 16 threads, PyTorch 2.8.0+cpu)

TextPromptDenoiseDecodeTotalAudioRTF
short (50 chars)18.6s115.0s2.5s136.0s5.8s23.6
medium (183 chars)14.5s199.1s11.6s225.2s14.4s15.6
long (450 chars)15.4s639.4s55.7s710.5s35.5s20.0

This is not practical. 136 seconds to generate 5.8 seconds of audio. The 3.3B parameter DiT running 30 denoising steps in fp32 on CPU just doesn’t have the throughput. The denoise step alone takes 3.8-21 seconds per iteration depending on sequence length.

#ROCm GPU (fp16, PyTorch 2.11.0+rocm7.13, bitsandbytes 0.50.0.dev0)

1 warmup run + 3-run averages:

TextPromptDenoiseDecodeTotalAudioRTF
short (50 chars)5.0s29.1s4.9s39.3s5.8s6.8
medium (183 chars)5.0s44.6s12.3s61.9s14.4s4.3
long (450 chars)5.0s105.8s30.4s141.2s35.5s4.0

The GPU path is 3.5-5.0x faster end-to-end, and the denoising speedup is even better at 3.9-6.0x. The improvement scales with sequence length because longer latent sequences give the GPU more parallelism. Each denoising step takes a consistent 1.49s for short text and 3.53s for long text, remarkably stable across runs.

#Warmup Tax

The first generation on GPU is expensive. MIOpen needs to discover kernel configurations for the model’s specific tensor dimensions, and the GemmFwdRest solver churns through workspace allocation checks. First-run times: 256s for short text, 605s for long text. After that, every subsequent run is stable. This is consistent with what I’ve seen on every ROCm model, warmup is not optional if you want realistic benchmarks.

#The Decode Bottleneck

The audio decode (VAE + vocoder) is a significant chunk of the total pipeline: 30.4 seconds out of 141.2 for long text, about 22%. The vocoder uses conv1d layers that trigger MIOpen workspace warnings, the same pattern that caused GPU hangs with Echo-TTS’s S1-DAC autoencoder. In this case the ops complete successfully in fp16, but the workspace warnings suggest suboptimal kernel selection.

#Optimization: Fewer Steps

DramaBox defaults to 30 Euler steps. That’s a lot of DiT forward passes through a 3.3B model. The obvious question: how far can you reduce step count before quality degrades?

I benchmarked 10, 15, and 20 steps on GPU (same fp16 config, 1 warmup + 3-run averages):

StepsShort RTFMedium RTFLong RTFLong Denoise
306.804.303.97105.8s
204.773.242.9770.3s
154.012.732.4852.8s
103.232.221.9835.0s

The per-step time is essentially constant regardless of total step count, so the relationship is linear. 10 steps cuts the denoise time by exactly 3x relative to 30 steps (35.0s vs 105.8s for long text). The prompt encode and decode times stay flat since they’re independent of step count.

At 10 steps the long text RTF drops from 3.97 to 1.98, almost halved. I listened to the outputs at each step count. 10 steps sounds noticeably less refined than 30, the prosody is a bit flat and there’s subtle metallic artifact on certain consonants, but the words are clear and the emotional direction from the prompt still comes through. For batch generation or draft previews, 10-15 steps is very usable.

#Optimization: torch.compile

torch.compile with mode="reduce-overhead" on the velocity model gives a clean 22-24% speedup on the denoising loop, consistent across all sequence lengths:

ConfigShort RTFMedium RTFLong RTFLong Denoise
10 steps3.232.221.9835.0s
10 steps + compile2.891.971.7526.7s

Per-step latency drops from 0.88s to 0.68s for short text, and 3.50s to 2.67s for long. The compilation warmup happens during the first generation (absorbed into the MIOpen warmup pass that’s already required), so the effective cost is zero if you’re doing multiple generations.

RTF 1.75 for long text means DramaBox now takes 1.75 seconds to generate 1 second of audio. That’s a 2.3x improvement over the 30-step baseline of 3.97.

#Optimization: Hybrid GPU/CPU Decode (Didn’t Help)

I tested the hybrid strategy that worked well for Echo-TTS, keeping the DiT on GPU for denoising but moving the audio decoder to CPU (fp32) to avoid MIOpen workspace pressure. Results at 10 steps:

ConfigShort RTFMedium RTFLong RTF
GPU decode (fp16)3.232.221.98
CPU decode (fp32)3.002.592.51

CPU decode is marginally faster for short text (3.8s vs 4.8s) but significantly slower for medium and long (17.5s vs 12.3s, 48.8s vs 30.4s). Unlike Echo-TTS’s S1-DAC autoencoder, which crashes on GPU entirely, DramaBox’s vocoder runs fine on the Radeon 8060S in fp16. The GPU’s parallelism wins on the larger sequences.

#Where DramaBox Fits

With the optimizations applied, here’s the updated comparison across all TTS models I’ve benchmarked on this machine:

ModelConfigRTFNotes
Echo-TTSCPU optimized, 10 steps, long0.38Fastest overall
Echo-TTSGPU hybrid fp16, 10 steps, long0.52ROCm container
OmniVoice8 steps, voice design, CPU0.56Simplest setup
VoxCPM.cppVoxCPM1.5 Q8_0, CPU1.23GGUF quantized
OmniVoice8 steps, voice clone, CPU1.52With reference audio
DramaBox10 steps + compile, long1.75Most expressive
DramaBox30 steps baseline, long4.0Default config

DramaBox is still the slowest, but RTF 1.75 puts it much closer to practical territory than the initial 4.0. It’s doing fundamentally more than the others, a 3.3B parameter model with explicit emotional control through prompting (laughs, sighs, pauses, tone shifts), things none of the other models support. On NVIDIA hardware (H100), Resemble claims ~2.5 seconds per generation, which would put it solidly under real-time.

The optimization stack is additive: reduced steps gives ~2x, torch.compile adds another ~1.2x on top. Together they turn a 141-second generation into a 62-second one for 35 seconds of audio.

#Practical Notes

A few things worth flagging if you try this yourself:

bitsandbytes versioning is a mess. The preview wheel filename says 1.33.7.preview but the installed package reports version 0.50.0.dev0. uv will complain about the mismatch unless you set UV_SKIP_WHEEL_FILENAME_CHECK=1. This is the same across the bitsandbytes continuous-release builds.

The kernels package conflicts with huggingface_hub<1.0. DramaBox pins huggingface_hub below 1.0, but the kernels package (which bitsandbytes tries to use for optimized CPU 4-bit forward) requires newer huggingface_hub internals. Don’t install it, bitsandbytes works fine without it.

fp16 works, bf16 untested on the full model. Basic bf16 matmul passes on the TheRock nightlies, which is an improvement over older ROCm versions where bf16 caused hard GPU hangs. But I didn’t test full DramaBox inference in bf16. Given Echo-TTS’s experience with bf16 on gfx1151, fp16 is the safe choice.

Model downloads total ~16.5 GB. The auto-downloader in src/model_downloader.py handles everything, but budget for the disk space.

torch.compile warmup overlaps with MIOpen warmup. The first generation is already slow due to MIOpen kernel discovery. torch.compile’s graph compilation happens during that same warmup pass, so it doesn’t add a separate penalty. After warmup, every run benefits from the compiled graph.