ROCm 7 on Strix Halo: Benchmarking the New Toolbox Images

AMD released ROCm 7.13 as a tech preview today. The Phoronix coverage caught my eye because the release notes specifically mention “new optimizations for Ryzen AI Max 300 Strix Halo” alongside Instinct MI350P support, expanded APU coverage, and the ROCprof Trace Decoder going open source.

My entire local inference stack runs through toolbox containers from kyuz0/amd-strix-halo-toolboxes. The current production container is llama-rocm-6.4.4, which I migrated to in early May after it showed +17% to +47% prompt processing gains over the Vulkan RADV backend. It’s been stable since. The question now is whether a newer ROCm version moves the needle further.

#What I Tested

kyuz0 doesn’t publish a rocm-7.13 tag, but the refresh script I maintain has two newer targets:

llama-rocm-7.2.3, using rocm-core-7.2.3 from the ROCm 7.2 production stream
llama-rocm7-nightlies, built from TheRock nightly builds

I refreshed both candidates without touching my active llama-rocm-6.4.4 container, then ran llama-bench side-by-side with identical flags.

One thing worth noting upfront: this isn’t a pure ROCm comparison. The candidate containers also ship a newer llama.cpp build (9199 vs 9067 on the baseline), so the results reflect both the ROCm runtime change and 132 llama.cpp commits worth of changes to the HIP backend, GEMM kernels, and graph scheduling. Separating those two variables would require building the same llama.cpp revision against both ROCm versions, which I haven’t done yet.

#Version Snapshot

Container	llama.cpp	ROCm
`llama-rocm-6.4.4`	build 9067 (`44dbe8c52`)	`rocm-core-6.4.4` (Fedora 43 packages)
`llama-rocm-7.2.3`	build 9199 (`39cf5d619`)	`rocm-core-7.2.3` (RHEL 10 packages)
`llama-rocm7-nightlies`	build 9199 (`39cf5d619`)	TheRock, `/opt/rocm-7.0`, version file says 7.14.0

That last line surprised me. The nightlies container self-reports as ROCm 7.14.0, not 7.13. TheRock nightly builds roll forward continuously, so by the time kyuz0’s CI rebuilt the image today it had already moved past the 7.13 tag. Something to keep in mind if you’re trying to pin to a specific release.

#Benchmark Setup

Same flags I use in production, matched across all three containers:

llama-bench \
  -m <model.gguf> \
  -fa 1 -ngl 999 \
  -b 512 -ub 256 \
  -ctk q8_0 -ctv q8_0 \
  -mmp 0 \
  -p 512 -n 128 \
  -r 3 -o md

Three models that fit in the llama-bench allocation path: Qwen3.5-9B, Qwen3.5-4B, and Gemma 4 E4B, all Q8_K_XL quantizations.

#Results

Container	Model	PP (t/s)	TG (t/s)
rocm-6.4.4	Qwen3.5-9B	992.75	16.58
rocm-7.2.3	Qwen3.5-9B	903.23	15.87
rocm7-nightlies	Qwen3.5-9B	928.09	17.95
rocm-6.4.4	Qwen3.5-4B	1729.50	29.49
rocm-7.2.3	Qwen3.5-4B	1837.20	31.39
rocm7-nightlies	Qwen3.5-4B	1579.35	30.48
rocm-6.4.4	Gemma 4 E4B	1512.61	30.28
rocm-7.2.3	Gemma 4 E4B	1536.94	28.59
rocm7-nightlies	Gemma 4 E4B	1801.36	30.81

#Deltas vs Baseline

Container	Model	PP	TG
rocm-7.2.3	Qwen3.5-9B	-9.0%	-4.3%
rocm7-nightlies	Qwen3.5-9B	-6.5%	+8.3%
rocm-7.2.3	Qwen3.5-4B	+6.2%	+6.4%
rocm7-nightlies	Qwen3.5-4B	-8.7%	+3.4%
rocm-7.2.3	Gemma 4 E4B	+1.6%	-5.6%
rocm7-nightlies	Gemma 4 E4B	+19.1%	+1.8%

No clear winner. Each backend wins on some model/metric combinations and loses on others. The nightlies container posted the single best result in the set, +19% prompt processing on Gemma 4 E4B, but also the single worst regression at -8.7% on Qwen3.5-4B prompt processing.

#The Large Model Problem

My production workloads are mostly 27B to 35B class models: Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma 4 31B. I tried loading Gemma 4 31B Q8 through llama-bench in all three containers, and all three failed with cudaMalloc failed: out of memory. The reported GPU-visible memory was only about 14 GiB despite the machine having 128 GB of unified RAM.

This is a llama-bench allocation path issue, not a runtime issue. These same models load and run fine through llama-server in the toolbox (that’s my production setup). But it means the numbers above are directional for smaller models only. I still need to run a proper serving-path benchmark with the actual 27-35B models to make a real migration decision.

#What I’m Taking Away

The ROCm 7.13 release notes call out “new optimizations for Ryzen AI Max 300,” but the highlighted improvement is RCCL multi-node clustering for distributed inference over Ethernet. That’s relevant for multi-machine tensor parallelism setups, not single-node llama.cpp inference. The other library changes (hipBLASLt batched GEMM, Composable Kernel quantization kernels) are more likely to show up in PyTorch/vLLM workloads than in llama.cpp’s hand-rolled HIP kernels.

This matches what the benchmarks show. There’s no dramatic across-the-board improvement from moving to a newer ROCm for this specific use case. The per-model variance is larger than the per-backend variance, same pattern I’ve seen with every other tuning parameter on this machine. Speculative decoding is still the biggest single throughput lever. Backend choice is secondary.

I’m keeping llama-rocm-6.4.4 as my production toolbox for now. The plan is to run a second pass with the larger models through the normal llama-server serving path, where memory allocation works differently, and only switch if there’s a consistent win with no stability regressions. If you’re running Strix Halo, the same advice applies: benchmark your specific models with your specific flags before migrating. ROCm version changes are model-family dependent, not universally better or worse.

#Links

ROCm 7.13 release notes: rocm.docs.amd.com
TheRock 7.13 release: github.com/ROCm/TheRock
Toolbox images: kyuz0/amd-strix-halo-toolboxes
Infrastructure context: Local LLM Infrastructure on Strix Halo
Speculative decoding benchmarks: Speculative Decoding on Strix Halo