End-to-end observability stack for LLM inference workloads. Correlates GPU utilization, inference latency, token throughput, and cost efficiency on a single Grafana dashboard — built as a live demo for a 25-minute conference presentation.
LLM inference is one of the most expensive workloads in the cloud. GPUs cost $2–30/hour, inference latency directly impacts user experience, and traditional observability stacks fail to answer critical questions:
- Why is my P99 latency spiking? Is it the model, the GPU, or the request queue?
- Why is my GPU at 30% utilization when I'm getting timeouts?
- How much am I paying per million tokens, and is it getting worse?
- Which requests are slow, and what do they have in common?
This project demonstrates how to build visibility into all of these using open-source tools.
┌──────────────────────── Host ────────────────────────────┐
│ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ Ollama │◄────────│ Locust Benchmark │ │
│ │ (LLM server)│ OpenAI │ (load generator) │ │
│ │ :11434 │ API │ │────┐ │
│ └─────────────┘ └──────────────────────┘ │ │
│ │ │
└───────────────────────────────────────────────────────│───┘
│
┌───────────── Docker Compose (Observability) ──────────│───┐
│ ▼ │
│ ┌──────────────┐ Locust metrics ┌──────────────┐ │
│ │ TimescaleDB │◄─────────────────────│ Grafana │ │
│ │ (PostgreSQL) │ (request table) │ :3000 │ │
│ └──────────────┘ │ │ │
│ │ ┌──────────┐ │ │
│ ┌──────────────┐ DCGM metrics │ │Dashboard │ │ │
│ │ Prometheus │◄── GPU Simulator ───►│ │ Panels │ │ │
│ │ :9090 │ :9400 │ └──────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ┌──────────┐ │ │
│ ┌──────────────┐ container logs │ │ Logs │ │ │
│ │ Loki │◄── Promtail ────────►│ │ Panel │ │ │
│ │ :3100 │ │ └──────────┘ │ │
│ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
- Locust sends concurrent inference requests to Ollama via the OpenAI-compatible API
- Per-request metrics (TTFT, latency, throughput, token counts) are pushed to TimescaleDB in real-time via the
locust-pluginsTimescaleDB listener - A GPU Metrics Simulator generates realistic NVIDIA DCGM metrics (A100-style) correlated with actual benchmark load, scraped by Prometheus
- Promtail tails benchmark logs and ships them to Loki
- Grafana queries all three datasources to render the LLM Inference Observatory dashboard
The LLM Inference Observatory dashboard has six sections:
| Panel | Source | Description |
|---|---|---|
| Request Rate | TimescaleDB | Requests per second over time (10s buckets) |
| Active Users | TimescaleDB | Current concurrent user count |
| Error Rate | TimescaleDB | Percentage of failed requests |
| Total Requests | TimescaleDB | Cumulative request count |
| Avg Latency | TimescaleDB | Average end-to-end latency (ms) |
| Latency Percentiles | TimescaleDB | P50 / P95 / P99 total latency over time |
| Panel | Source | Description |
|---|---|---|
| Time to First Token | TimescaleDB | TTFT avg and P95 over time |
| Throughput | TimescaleDB | Output tokens per second |
| Per-Token Latency | TimescaleDB | Inter-token latency avg and P95 |
| Panel | Source | Description |
|---|---|---|
| Output Tokens / Request | TimescaleDB | Average generation length over time |
| Prompt Tokens / Request | TimescaleDB | Average prompt length over time |
| Panel | Source | Description |
|---|---|---|
| GPU Utilization % | Prometheus | SM utilization (0–100%) |
| GPU Memory Used | Prometheus | VRAM usage gauge (80 GB HBM2e) |
| GPU Temperature | Prometheus | Die temperature in Celsius |
| GPU Power Draw | Prometheus | Power consumption in watts |
| Panel | Source | Description |
|---|---|---|
| Est. Cost / 1M Tokens | TimescaleDB | Based on $2.21/hr A100 on-demand pricing |
| Tokens per GPU-Hour | TimescaleDB | Throughput efficiency metric |
| Efficiency Trend | TimescaleDB | Aggregate tokens/s over time |
| Panel | Source | Description |
|---|---|---|
| Benchmark Logs | Loki | Filterable output with per-request details |
| Tool | Install | Purpose |
|---|---|---|
| Docker | Docker Desktop | Runs the observability stack |
| uv | curl -LsSf https://astral.sh/uv/install.sh | sh |
Python package manager |
| Ollama | brew install ollama |
Local LLM inference server |
git clone https://github.com/rudrakshkarpe/llm-inference-observatory.git
cd llm-inference-observatory
# Start Ollama and pull a model
ollama serve &
ollama pull llama3.2
# Start the observability stack
docker compose up -d --build
# Install Python dependencies
uv sync
# Run the benchmark (5 minutes)
uv run locust \
--config locust-observability.conf \
-u 4 -r 1 -t 5m \
--provider vllm --chat --stream \
-p 128 -o 64 \
-m llama3.2 \
2>&1 | tee logs/locust.log
# Open Grafana
open http://localhost:3000 # login: admin / adminOr run everything with one command:
MODEL=llama3.2 ./scripts/demo.shllm-inference-observatory/
├── pyproject.toml # uv project — Locust + benchmark deps
├── uv.lock # Deterministic lockfile
├── .python-version # Python 3.11
├── docker-compose.yml # Observability stack (6 services)
├── locust-observability.conf # Locust → Ollama + TimescaleDB config
│
├── benchmark/ # LLM load testing suite
│ └── llm_bench/
│ ├── load_test.py # Locust-based load generator
│ ├── bench_serving.py # Async request-count benchmark
│ └── plot_bench.py # CSV → PNG plotting utility
│
├── gpu-sim/ # Simulated NVIDIA GPU metrics
│ ├── Dockerfile
│ ├── pyproject.toml
│ └── exporter.py # DCGM-compatible Prometheus exporter
│
├── timescaledb/
│ └── init.sql # Schema: request, testrun, user_count, events
│
├── prometheus/
│ └── prometheus.yml # Scrape config for GPU simulator
│
├── loki/
│ └── loki-config.yml # Loki single-binary configuration
│
├── promtail/
│ └── promtail-config.yml # Log shipping: benchmark logs → Loki
│
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── datasources.yml # TimescaleDB + Prometheus + Loki
│ └── dashboards/
│ ├── dashboards.yml # File-based dashboard provider
│ └── llm-inference.json # LLM Inference Observatory dashboard
│
├── scripts/
│ └── demo.sh # One-command demo launcher
│
└── logs/ # Benchmark output (tee'd for Promtail)
The benchmark suite uses Locust to simulate continuous production-like load against OpenAI-compatible LLM endpoints. It supports configurable concurrency, token length distributions (constant, uniform, normal, exponential), streaming, and both time-bounded and request-count-bounded test modes.
Per-request custom metrics are emitted via Locust's event system:
| Metric | Unit | Description |
|---|---|---|
time_to_first_token |
ms | Time from request send to first streamed token |
total_latency |
ms | End-to-end request duration |
throughput (out tok/s) |
tok/s | Output tokens per second per request |
latency_per_token |
ms/tok | Inter-token generation latency |
num_tokens |
count | Tokens generated |
prompt_tokens |
count | Tokens in the prompt |
The locust-plugins TimescaleDB listener writes every request event into PostgreSQL. The schema (timescaledb/init.sql) creates four tables:
request— hypertable; per-request metrics withtime,name,response_time,success,request_typetestrun— benchmark run metadata (start/end time, user count, avg RPS)user_count— periodic user count snapshotsevents— lifecycle events (start, ramp-up complete, stop)
Custom LLM metrics are stored as rows with request_type = 'METRIC' and the metric value in response_time.
Since this POC targets Mac (no NVIDIA GPU), the gpu-sim service generates DCGM-compatible Prometheus metrics:
- Polls TimescaleDB for recent request rate (RPS)
- Maps RPS → GPU utilization (1 RPS → ~22%, 5 RPS → ~70%, 10 RPS → ~98%)
- Derives correlated memory, temperature, power, and SM clock metrics
- Applies exponential smoothing + Gaussian noise for visual realism
Exposed metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_SM_CLOCK
In production, replace gpu-sim with nvidia-dcgm-exporter — the dashboard queries work without modification.
| Flag | Description | Default |
|---|---|---|
-u |
Concurrent users | 4 |
-r |
Spawn rate (users/sec) | 1 |
-t |
Test duration | 5m |
-p |
Prompt length (tokens) | 128 |
-o |
Max output tokens | 64 |
-m |
Model name | — |
--stream |
Enable token streaming | false |
--chat |
Use /v1/chat/completions | false |
--qps |
Fixed queries-per-second mode | — |
--max-tokens-distribution |
Output length distribution (constant/uniform/normal/exponential) | constant |
| Variable | Default | Description |
|---|---|---|
MODEL |
tinyllama | Ollama model to benchmark |
USERS |
4 | Concurrent Locust users |
SPAWN_RATE |
1 | Users spawned per second |
DURATION |
5m | Benchmark duration |
PROMPT_TOKENS |
128 | Prompt token count |
MAX_TOKENS |
64 | Max generation tokens |
| Service | Port | URL |
|---|---|---|
| Grafana | 3000 | http://localhost:3000 |
| Prometheus | 9090 | http://localhost:9090 |
| Loki | 3100 | http://localhost:3100 |
| TimescaleDB | 5432 | postgres://postgres:password@localhost:5432/locust |
| GPU Simulator | 9400 | http://localhost:9400/metrics |
| Ollama | 11434 | http://localhost:11434 |
Designed for a 25-minute live demo:
- Show the empty dashboard —
docker compose up -d, open Grafana - Start the benchmark — run Locust, watch panels populate in real-time
- Walk through each row — explain what each metric tells you about inference health
- Spike the load — increase to 10 users, show latency degradation and GPU saturation
- Correlate metrics — "GPU is at 95% but TTFT is spiking — we're compute-bound, not memory-bound"
- Check the logs — filter for slow requests in the Loki panel
- Cost analysis — "At current throughput we're paying $X per million tokens"
- Bridge to production — "Replace gpu-sim with dcgm-exporter, same dashboard"
To deploy with actual NVIDIA GPUs:
- Replace the
gpu-simservice indocker-compose.yml:gpu-metrics: image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04 deploy: resources: reservations: devices: - capabilities: [gpu]
- Update
prometheus/prometheus.ymlto scrapegpu-metrics:9400 - Dashboard queries use standard DCGM metric names — no changes needed
| Problem | Solution |
|---|---|
| Dashboard shows "No data" | Check the time range picker — data may be outside the window. Set to "Last 1 hour" or re-run the benchmark |
psycogreen is not installed |
Run uv sync — the locust-plugins[dashboards] extra installs it |
| Ollama connection refused | Ensure ollama serve is running on the host |
| TimescaleDB not ready | Wait ~10s after docker compose up for the health check |
| GPU sim shows 0% utilization | The simulator needs active benchmark load to correlate with |
docker compose down -v # stop containers, remove volumes
rm -rf .venv # remove Python virtual environmentApache License 2.0