Unsloth Studio on Strix Halo: Full GPU Training Without System ROCm

Most of my work on the Strix Halo has been inference-focused: building llama.cpp toolboxes, benchmarking speculative decoding, tuning Vulkan and ROCm backends for generation throughput. Training has always been the missing piece. Strix Halo has 128 GB of unified memory that both CPU and GPU can address directly, which is a lot of VRAM by consumer GPU standards. The question was whether anyone had made the software stack work.

Unsloth Studio is an open-source, no-code web UI for training, running, and exporting models. It handles LoRA, QLoRA, full fine-tuning, data preparation, model export to GGUF and safetensors, and chat inference, all from a browser. The catch is that on AMD, the official line is “Studio support coming soon.” In practice, it works right now if you fix a few things.

#The Problem

Unsloth Studio runs as a Python web app backed by PyTorch. On NVIDIA, the installer handles everything. On AMD, you need ROCm-enabled PyTorch, which means either system-level ROCm (/opt/rocm, hipcc, the full stack) or pip-packaged ROCm from the TheRock project.

My Strix Halo setup doesn’t have system ROCm. The GPU runs through the kernel’s amdgpu module with /dev/kfd exposed, and I access ROCm through podman toolbox containers for llama.cpp inference. The Unsloth installer runs on the host, can’t find rocminfo or amd-smi, and installs CPU-only PyTorch. Studio starts, but with "chat_only": true, meaning no training, no safetensors inference, just GGUF chat through llama.cpp.

The additional complication is that gfx1151 (the Radeon 8060S integrated GPU) isn’t in the standard PyTorch ROCm wheels from pytorch.org. Those target mainstream discrete AMD GPUs. For Strix Halo, you need the gfx1151-specific nightly builds.

#What Worked

A community guide validated this on Ubuntu. I adapted it for Fedora 43. The approach is:

Let the official installer set up the Studio venv and all non-PyTorch dependencies
Replace CPU PyTorch with ROCm PyTorch from the gfx1151 nightlies
Fix three known ROCm-specific issues via a sitecustomize.py in the venv

#Running the Installer

curl -fsSL https://unsloth.ai/install.sh | sh -s -- --python 3.13

The --python 3.13 flag is important. My system Python is 3.14, and PyTorch doesn’t publish wheels for cp314 yet. The installer creates a venv at ~/.unsloth/studio/unsloth_studio/ using uv, installs Unsloth and its dependencies, builds llama.cpp for GGUF inference, and sets up the Studio frontend.

One quirk on Fedora: the installer’s first run can fail silently (exit 1 with no visible error). Running the downloaded script directly with bash -x to get a trace showed it completing fine. If the piped command fails, download the script and run it directly.

After install, PyTorch is 2.10.0+cpu. Expected, since the installer couldn’t detect an AMD GPU.

#Replacing PyTorch with gfx1151 Nightlies

The TheRock project publishes pip-packaged ROCm builds at https://rocm.nightlies.amd.com/v2/gfx1151/. These bundle the entire ROCm runtime as Python wheels: libamdhip64, libhsa-runtime64, libhsakmt, the works. No system /opt/rocm needed. The pip-installed runtime talks directly to the kernel via /dev/kfd.

VENV_PY="$HOME/.unsloth/studio/unsloth_studio/bin/python"
IDX="https://rocm.nightlies.amd.com/v2/gfx1151/"
"$VENV_PY" -m pip install --index-url "$IDX" "rocm[libraries,devel]"
"$VENV_PY" -m pip install --index-url "$IDX" --pre --force-reinstall torch torchvision torchaudio

The --force-reinstall is necessary because the existing CPU torch satisfies pip’s version constraints and won’t be replaced otherwise. The install pulls about 2.7 GB of ROCm SDK packages and a new PyTorch build.

After this, PyTorch reports torch 2.11.0+rocm7.13.0a20260424, HIP 7.13.26162, and cuda_available True. ROCm PyTorch uses the CUDA API surface over HIP, so torch.cuda.* works exactly like it does on NVIDIA.

Unsloth declares torch<2.11.0 in its metadata while the gfx1151 nightlies ship 2.11.0. This is a metadata lag, not a functional issue.

#The bitsandbytes Fix

QLoRA (4-bit quantized LoRA) requires bitsandbytes, and the standard PyPI version has a 4-bit decode NaN bug on every AMD GPU. The fix is the preview wheel from the continuous release:

"$HOME/.unsloth/studio/unsloth_studio/bin/pip" install --force-reinstall --no-cache-dir --no-deps \
  "https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl"

This installs 0.50.0.dev0, replacing the broken 0.49.2.

#Three Issues Solved by sitecustomize.py

I created a sitecustomize.py in the venv’s site-packages/ that runs before any import:

MoE segfault: import unsloth calls torch._grouped_mm to probe MoE support. On ROCm, this probe doesn’t raise a Python exception, it SIGSEGVs (exit 139). Setting UNSLOTH_MOE_BACKEND=native_torch bypasses the probe entirely.

Compile cache scatter: Unsloth writes a compile cache to ./unsloth_compiled_cache/ relative to the current working directory. Pinning UNSLOTH_COMPILE_LOCATION to an absolute path under ~/.unsloth/studio/ prevents it from littering project directories.

bitsandbytes library name: HIP 7.13 makes bitsandbytes look for libbitsandbytes_rocm713.so, but the preview wheel only ships libbitsandbytes_rocm71.so. Setting BNB_ROCM_VERSION=71 tells it to load the bundled library.

"""Unsloth on ROCm - Studio venv bootstrap."""
from __future__ import annotations
import os
from pathlib import Path

_compile_dir = Path.home() / ".unsloth" / "studio" / "unsloth_compiled_cache"
_compile_dir.mkdir(parents=True, exist_ok=True)
os.environ.setdefault("UNSLOTH_COMPILE_LOCATION", str(_compile_dir.resolve()))
os.environ.setdefault("UNSLOTH_MOE_BACKEND", "native_torch")
os.environ.setdefault("BNB_ROCM_VERSION", "71")
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")

This file lives in site-packages/ and will get overwritten if you recreate the venv or upgrade packages aggressively. Keep a copy.

#Verification

The hard gate is a minimal GPU tensor allocation:

"$HOME/.unsloth/studio/unsloth_studio/bin/python" -c \
  "import torch; x=torch.tensor([1.0], device='cuda'); print(x)"

This must exit 0, not 139. Then import unsloth from a neutral working directory (not inside any git repo) to confirm the MoE fix works. Both passed on the first try with kernel 6.19.14, which is newer than the guide’s validated 6.17.

Starting Studio:

~/.unsloth/studio/unsloth_studio/bin/unsloth studio -H 0.0.0.0 -p 8888

The startup log shows the key line: “Hardware detected: ROCm (HIP 7.13.26162) — Radeon 8060S Graphics”. The health endpoint confirms it:

{
  "status": "healthy",
  "chat_only": false,
  "version": "2026.5.2"
}

chat_only: false means the full training backend is available.

#What You Get

With this setup, every Studio feature works:

Chat/inference: Both GGUF (via built-in llama.cpp) and safetensors (via PyTorch on ROCm)
Training: LoRA, QLoRA with 4-bit quantization, full fine-tuning across 500+ model families
Data recipes: Upload PDFs, CSVs, JSON, and auto-generate training datasets
Model export: Save to GGUF, safetensors, upload to Hugging Face
Model arena: Side-by-side comparison of base vs fine-tuned models
API endpoint: OpenAI-compatible API for use with external tools

Flash Attention 2 is not available on AMD. Unsloth falls back to xFormers automatically, with comparable performance.

#The Setup

Component	Detail
CPU/GPU	AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, gfx1151)
RAM	128 GB LPDDR5X unified memory
OS	Fedora 43, kernel 6.19.14-200.fc43.x86_64
ROCm	TheRock pip packages, HIP 7.13.26162
PyTorch	2.11.0+rocm7.13.0a20260424
Python	3.13.10 (in Studio venv)
bitsandbytes	0.50.0.dev0 (preview)
Studio	Unsloth Studio 2026.5.2

#Known Limitations

No system rocminfo: bitsandbytes logs a warning about not being able to detect the GPU architecture, but the BNB_ROCM_VERSION override makes it load the correct library anyway.

64 GB allocation cap: The TheRock nightlies may cap GPU memory allocation to 64 GB. With 128 GB of unified memory, this means you can’t use the full pool for training. Smaller models and LoRA/QLoRA work fine within this limit.

Upgrade fragility: Running pip install -U unsloth can pull CUDA PyTorch from PyPI, silently replacing the ROCm build. After any Unsloth upgrade, re-run the PyTorch install from the gfx1151 index. Or upgrade with --no-deps and manage PyTorch separately.

#Practical Takeaway

The combination of pip-packaged ROCm and the gfx1151 nightly index means you can run the full Unsloth training stack on Strix Halo without installing a single system-level ROCm package. The kernel driver (amdgpu + /dev/kfd) is enough. The three workarounds (force-reinstall PyTorch, bitsandbytes preview, sitecustomize.py) are well-documented and took about 10 minutes to apply after the base install.

For Strix Halo owners who’ve been limited to inference, this opens up fine-tuning on the same hardware. 128 GB of unified memory is enough for LoRA on models up to ~70B parameters at 4-bit quantization, which covers most of the open model landscape.

#Links

Unsloth Studio: unsloth.ai/docs/new/studio
AMD install guide: unsloth.ai/docs/get-started/install/amd
Community ROCm guide: t-sinclair2500/unsloth_studio_rocm_Halo_Strix
TheRock gfx1151 nightlies: rocm.nightlies.amd.com/v2/gfx1151
kyuz0 Strix Halo toolboxes: kyuz0/amd-strix-halo-toolboxes
Previous infrastructure post: Local LLM Infrastructure on Strix Halo