llama-cpp-python in Docker
A Dockerfile and docker-compose setup for running llama.cpp with its Python bindings in a container, because finding a working one shouldn't be this hard.
This one’s short because the project is short. I wanted to run llama-cpp-python in a Docker container. You’d think this would be a solved problem — it’s one of the most popular local inference libraries — but every existing container I found was either broken, outdated, or required more configuration than just building the thing from scratch.
So I threw together a minimal Dockerfile and docker-compose setup based on an example from the llama-cpp-python repo. Point it at a GGUF model file, docker compose up, and you’re running inference. That’s it.
Why Bother
Local LLM inference is load-bearing infrastructure for basically every other project I work on. Having a reliable, reproducible container means I can spin up an inference endpoint on any machine without debugging build dependencies every time. It’s the kind of small utility project that saves a disproportionate amount of time over its lifetime.
The compose file mounts your model directory and exposes the API, so swapping models is just changing an environment variable and restarting. Nothing clever, just the boring plumbing that makes everything else possible.