Dreams | Sleeping Robots

> Jun 8, 2026

by Zetaphor

Benchmarking dots.tts on Strix Halo

RedNote's 2B continuous autoregressive TTS hits RTF 0.35 on the NVIDIA 4070 Super with the MeanFlow-distilled checkpoint, putting it among the fastest voice-cloning-capable models I've tested locally.

read more →

> Jun 7, 2026

llm local

by Zetaphor

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

BeeLlama.cpp's DFlash speculative decoding nearly triples dense model throughput on AMD Strix Halo, but in a strict head-to-head against my existing MTP setups, MTP still wins by 23-67% depending on the configuration.

read more →

> May 30, 2026

amd local

by Zetaphor

Running the MOSS-TTS Family on Strix Halo

Benchmarking OpenMOSS's 8B llama.cpp GGUF backend and 100M ONNX Nano model on AMD's Ryzen AI MAX+ 395, with thread scaling analysis and a surprising Nano result that beats everything else I've tested.

read more →

> May 27, 2026

opinion local

by Zetaphor

Meet People Where They Are

The AI discourse is stuck between two poles that refuse to talk to each other. If we actually care about this technology being used well, we need to stop evangelizing and start listening.

read more →

> May 26, 2026

llm local

by Zetaphor

Testing llama.cpp PR #21344: Faster MoE Prefill, but MTP Fights Back

A community PR optimizing CUDA kernels for GFX1151 delivers +24% prefill throughput on MoE models, but combining those same kernel changes with MTP speculative decoding makes inference slower. Not every optimization stacks.

read more →

> May 17, 2026

amd local

by Zetaphor

ROCm 7 on Strix Halo: Benchmarking the New Toolbox Images

AMD released ROCm 7.13 with Strix Halo optimizations. I benchmarked kyuz0's latest toolbox images against my current ROCm 6.4.4 production baseline to see if upgrading my llama-swap stack is worth it. The answer is complicated.

read more →

> May 15, 2026

amd local

by Zetaphor

Running DramaBox on Strix Halo

Getting Resemble AI's expressive TTS model running on AMD Strix Halo with no NVIDIA hardware. TheRock gfx1151 nightlies, bitsandbytes preview for ROCm, reduced step counts, and torch.compile bringing the 3.3B DiT from RTF 4.0 down to 1.75.

read more →

> May 11, 2026

llm local

by Zetaphor

Unsloth Studio on Strix Halo: Full GPU Training Without System ROCm

Getting Unsloth Studio's full training pipeline running on AMD Strix Halo (gfx1151) using pip-packaged ROCm nightlies, no /opt/rocm required. Chat, training, data recipes, and model export all working on Fedora 43.

read more →

> May 10, 2026

llm local

by Zetaphor

Gemma 4 MTP Assistant: 3.7x Faster 31B and +45% Faster 26B-A4B on Strix Halo

Google's official Gemma 4 MTP assistant heads bring speculative decoding to MoE models that couldn't benefit before, and nearly quadruple dense model throughput on AMD Strix Halo's bandwidth-limited unified memory.

read more →

> May 7, 2026

amd local

by Zetaphor

Optimizing Echo-TTS: CPU Beats GPU

Eight optimization attempts on Echo-TTS CPU inference, the five that worked, quality evaluation with voice cloning, and how the optimized CPU path ended up faster than the GPU hybrid.

read more →

> May 6, 2026

llm local

by Zetaphor

MTP Speculative Decoding: 4.8x Faster Qwen 3.6 27B on Strix Halo

Multi-Token Prediction turns Qwen 3.6 27B from 6 t/s to 30 t/s on AMD Strix Halo, succeeding where draft models and ngram decoding failed, by using prediction heads baked into the model itself.

read more →

> May 3, 2026

amd local

by Zetaphor

Benchmarking Echo-TTS on Strix Halo

Running a diffusion-based TTS model on AMD's Strix Halo, patching CUDA-only code for CPU, discovering a bf16 GPU hang on gfx1151, and a hybrid GPU/CPU trick that beats every other TTS model I've tested.

read more →

> Apr 17, 2026

amd local

by Zetaphor

Running HY-World 2.0 on Strix Halo: 3D World Reconstruction on an AMD iGPU

Porting Tencent's CUDA-only 3D world model to AMD's Radeon 8060S via ROCm Docker, flash-attention CK kernels, a fully compiled gsplat with wave32 patches, and complete 3D reconstruction output including Gaussian splats.

read more →

> Apr 15, 2026

llm local

by Zetaphor

Friends Don't Let Friends Use Ollama

Ollama gained traction by being the first easy llama.cpp wrapper, then spent years dodging attribution, misleading users, and pivoting to cloud, all while riding VC money earned on someone else's engine. Here's the full history, and why the alternatives are better.

read more →

> Apr 12, 2026

llm local

by Zetaphor

Speculative Decoding on Strix Halo: 2x Faster Gemma 4 31B Token Generation

Benchmarking speculative decoding with Gemma 4 E2B as a draft model for Gemma 4 31B on AMD Strix Halo, a bandwidth-bound setup where the optimal draft-max differs from discrete GPUs.

read more →

> Apr 11, 2026

agents llm

by Zetaphor

Pi Web UI: A Browser Interface for the Pi Coding Agent

A full-stack web interface that puts the Pi coding agent in the browser, with system-level access, session history, and model switching through a local LiteLLM proxy.

read more →

> Apr 10, 2026

strix-halo llm

by Zetaphor

Local LLM Infrastructure on Strix Halo

How LiteLLM, llama-swap, and Lemonade Server compose into a unified local inference platform, routing dozens of models across GPU and NPU through a single API endpoint, accessible anywhere via Tailscale and a local reverse proxy.

read more →

> Apr 9, 2026

llm amd

by Zetaphor

Running LLMs on the AMD NPU with Lemonade Server

Setting up AMD's Lemonade Server on Strix Halo to run LLM and Whisper inference on the XDNA 2 NPU, driver builds, architecture decisions, and benchmarks against the integrated GPU.

read more →

> Apr 9, 2026

amd local

by Zetaphor

Benchmarking OmniVoice on Strix Halo

Running a 600+ language zero-shot TTS model on an AMD integrated GPU, voice cloning benchmarks, ROCm compatibility adventures, and the container workaround that actually worked.

read more →

> Apr 9, 2026

amd local

by Zetaphor

Benchmarking VoxCPM2 on Strix Halo

Running a 2B parameter tokenizer-free TTS model in both Python and C++ on AMD's integrated GPU, near-real-time speech synthesis on CPU, and the Vulkan crash that stopped GPU acceleration in its tracks.

read more →

> Mar 15, 2026

voice amd

by Zetaphor

Self-Hosting Fish Audio on Strix Halo

Running Fish Audio's 4B parameter S2-Pro text-to-speech model locally on an AMD Strix Halo integrated GPU via ROCm and Podman.

read more →

> Mar 15, 2026

agents local

by Zetaphor

Medium-Claw: A Persistent AI Companion on Telegram

A Telegram bot backed by the Pi coding agent with autonomous scheduling, persistent memory, cross-session search, and a web dashboard.

read more →

> Mar 5, 2026

music local

by Zetaphor

LoopMaker Web

A browser-based AI music generation tool powered by ACE-Step, ported to Linux for local generation on AMD Strix Halo hardware.

read more →

> Feb 20, 2026

local tools

by Zetaphor

QuizForge: Self-Learning Quiz Maker

A full-stack quiz platform that turns markdown files and YouTube transcripts into mixed-format quizzes with AI grading, contextual chat, and performance analytics.

read more →

> Jan 30, 2026

agents local

by Zetaphor

Oneiros: A Personal AI Agent Platform

A modular collection of services for building a personal AI agent, tool use, memory, browser automation, TTS, and multi-platform chat interfaces.

read more →

> Dec 28, 2025

ocr local

by Zetaphor

OCR List Maker

Snap a photo of a handwritten list, OCR it with a local vision model, and print a formatted checklist on a thermal receipt printer.

read more →

> Nov 3, 2025

llm docker

by Zetaphor

llama-cpp-python in Docker

A Dockerfile and docker-compose setup for running llama.cpp with its Python bindings in a container, because finding a working one shouldn't be this hard.

read more →

> Aug 10, 2025

llm docker

by Zetaphor