llama-cpp-python in Docker | Sleeping Robots

This one’s short because the project is short. I wanted to run llama-cpp-python in a Docker container. You’d think this would be a solved problem, it’s one of the most popular local inference libraries, but every existing container I found was either broken, outdated, or required more configuration than just building the thing from scratch.

So I threw together a minimal Dockerfile and docker-compose setup based on an example from the llama-cpp-python repo. Point it at a GGUF model file, docker compose up, and you’re running inference. That’s it.

#Why Bother

Local LLM inference is load-bearing infrastructure for basically every other project I work on. Having a reliable, reproducible container means I can spin up an inference endpoint on any machine without debugging build dependencies every time. It’s the kind of small utility project that saves a disproportionate amount of time over its lifetime.

The compose file mounts your model directory and exposes the API, so swapping models is just changing an environment variable and restarting. Nothing clever, just the boring plumbing that makes everything else possible.

#Links

Source: llama-cpp-python-docker on GitHub