Observability-Driven LLM Infrastructure: Scaling Ollama with Podman & Prometheus
đ Observability-Driven LLM Infrastructure: Scaling Ollama with Podman & Prometheus
Pull up a chair. Clear the terminal. Most people treat local LLMs as a âset-and-forgetâ binaryâthey run a container, chat with a model, and call it a day. But in an engineering context, that is a liability. If you cannot measure your inference latency, monitor your VRAM pressure, or track request throughput, you arenât running a service; youâre running a hobby.
In this guide, we are going to build a production-grade local LLM stack using Ollama, Podman, and the Prometheus ecosystem. We will move from basic inference to a fully observable architecture that treats your local model as a first-class citizen of your infrastructure.
đ ď¸ The Foundation: Rootless Containerization & Hardware Acceleration
Before we touch the models, we need an execution environment that doesnât compromise the host system but still has direct access to the silicon. We use Podman for rootless execution, ensuring that our LLM service runs with the least privilege necessary.
The Hardware Problem: GFX and iGPU
If you are running on modern AMD hardware (like the Radeon 700 series), Ollama needs a nudge to recognize the GPU correctly. This is where HSA_OVERRIDE_GFX_VERSION comes into play, tricking the runtime into using a compatible instruction set.
Step-by-Step: Preparing the Host
- Persistence Layer: Create a dedicated directory for model weights. Storing models inside the container layer leads to massive image sizes and data loss on restart.
sudo mkdir -p /var/apps/ollama-models sudo chown -R $USER:$USER /var/apps/ollama-models - Environment Configuration: Create a tuning file
/var/apps/ollama.env. This separates the âwhatâ (the application) from the âhowâ (the hardware optimization).
đ§ Module 1: The Inference Engine â Optimized Ollama
We arenât using default settings. To make an LLM usable for agentic workflows (like Claude Code or custom ReAct agents), we need to expand the context window and optimize memory layouts.
Deep-Dive: The Tuning Blueprint
In our /var/apps/ollama.env, we implement the following strategic overrides:
1. Context Window Expansion: Standard LLMs often default to 2k or 4k tokens. For technical documentation analysis, this is insufficient. We push this to 64k.
OLLAMA_NUM_CTX=65536OLLAMA_CONTEXT_LENGTH=65536
2. Memory Efficiency:
Using OLLAMA_KV_CACHE_TYPE=q4_0 allows us to compress the Key-Value cache, reducing VRAM pressure during long convolutions without significantly degrading intelligence.
3. Latency Reduction:
OLLAMA_KEEP_ALIVE=24h: Keeps the model in VRAM. No more waiting 10 seconds for the âcold startâ on every request.LLAMA_ARG_FLASH_ATTN=1: Enables Flash Attention to speed up token generation.
Deployment Execution
Run the engine with direct device passthrough:
podman run -d \
--name=ollama \
-p 11434:11434 \
--device /dev/dri \
--device /dev/kfd \
--ulimit nofile=1048576:1048576 \
--env-file /var/apps/ollama.env \
-v /var/apps/ollama-models/:/var/apps/ollama-models/:z \
docker.io/ollama/ollama
đ Module 2: The Observability Stack â Metrics & Monitoring
An LLM is a black box. To open it, we introduce two sidecar components that transform raw API calls into time-series data.
Component A: The Ollama Exporter
The ollama-exporter bridges the gap between Ollamaâs internal state and Prometheus. It scrapes the /api/tags and current model status to report how many models are loaded and their health.
Deployment:
podman run -d \
--name=ollama_exporter \
-p 9400:9400 \
-e OLLAMA_HOST=http://ollama:11434 \
ghcr.io/maravexa/ollama-exporter:latest
Component B: The Metrics Proxy
While the exporter tells us about the state, the ollama-metrics proxy tells us about the traffic. It intercepts every request to measure:
- TTFT (Time To First Token): The critical metric for perceived responsiveness.
- TPS (Tokens Per Second): The actual throughput of your hardware setup.
âď¸ Module 3: Integration & Verification
Now we verify that our âHardenedâ stack is actually performing as expected.
Testing the Context Window
To verify that the 64k context window is active, use a large prompt or a long-form document and monitor for OOM (Out of Memory) errors. Because we set OLLAMA_KV_CACHE_TYPE=q4_0, your VRAM usage should remain stable even as the conversation grows.
The Observability Loop
- Prometheus Scrape: Configure Prometheus to scrape port
9400. - Grafana Dashboard: Build a dashboard focusing on:
- VRAM Saturation: Is the model fully offloaded? (
LLAMA_ARG_N_GPU_LAYERS=999). - Inference Latency: Compare your TPS against different quantization levels (q4 vs q8).
- VRAM Saturation: Is the model fully offloaded? (
đ Final Wrap-up: From Hobby to Infrastructure
Weâve moved from a simple binary to a professional AI inference stack. By implementing the following, weâve ensured our local LLM is production-ready:
- Hardware Alignment: GFX overrides for GPU stability.
- Performance Tuning: 64k context and Flash Attention for agentic utility.
- Infrastructure Standards: Rootless Podman with volume persistence.
- Full Observability: Prometheus integration to eliminate the âblack boxâ problem.
The goal isnât just to run a model; itâs to know exactly why that model is slow, where the bottleneck is, and how to scale it. Now you have the blueprintâgo build your own ironclad inference station.