Why Is My GPU Idle? — Observability for LLM Inference in Grafana

End-to-end observability stack for LLM inference workloads. Correlates GPU utilization, inference latency, token throughput, and cost efficiency on a single Grafana dashboard — built as a live demo for a 25-minute conference presentation.

The Problem

LLM inference is one of the most expensive workloads in the cloud. GPUs cost $2–30/hour, inference latency directly impacts user experience, and traditional observability stacks fail to answer critical questions:

Why is my P99 latency spiking? Is it the model, the GPU, or the request queue?
Why is my GPU at 30% utilization when I'm getting timeouts?
How much am I paying per million tokens, and is it getting worse?
Which requests are slow, and what do they have in common?

This project demonstrates how to build visibility into all of these using open-source tools.

Architecture

┌──────────────────────── Host ────────────────────────────┐
│                                                           │
│  ┌─────────────┐         ┌──────────────────────┐        │
│  │   Ollama     │◄────────│  Locust Benchmark    │        │
│  │  (LLM server)│ OpenAI  │  (load generator)    │        │
│  │  :11434      │  API    │                      │────┐   │
│  └─────────────┘         └──────────────────────┘    │   │
│                                                       │   │
└───────────────────────────────────────────────────────│───┘
                                                        │
┌───────────── Docker Compose (Observability) ──────────│───┐
│                                                       ▼   │
│  ┌──────────────┐    Locust metrics     ┌──────────────┐ │
│  │ TimescaleDB   │◄─────────────────────│   Grafana     │ │
│  │ (PostgreSQL)  │    (request table)   │   :3000       │ │
│  └──────────────┘                       │               │ │
│                                          │  ┌──────────┐ │ │
│  ┌──────────────┐    DCGM metrics       │  │Dashboard │ │ │
│  │  Prometheus   │◄── GPU Simulator ───►│  │  Panels  │ │ │
│  │  :9090        │    :9400             │  └──────────┘ │ │
│  └──────────────┘                       │               │ │
│                                          │  ┌──────────┐ │ │
│  ┌──────────────┐    container logs     │  │  Logs    │ │ │
│  │    Loki       │◄── Promtail ────────►│  │  Panel   │ │ │
│  │  :3100        │                      │  └──────────┘ │ │
│  └──────────────┘                       └──────────────┘ │
└──────────────────────────────────────────────────────────┘

Data Flow

Locust sends concurrent inference requests to Ollama via the OpenAI-compatible API
Per-request metrics (TTFT, latency, throughput, token counts) are pushed to TimescaleDB in real-time via the locust-plugins TimescaleDB listener
A GPU Metrics Simulator generates realistic NVIDIA DCGM metrics (A100-style) correlated with actual benchmark load, scraped by Prometheus
Promtail tails benchmark logs and ships them to Loki
Grafana queries all three datasources to render the LLM Inference Observatory dashboard

Dashboard Panels

The LLM Inference Observatory dashboard has six sections:

Request Overview

Panel	Source	Description
Request Rate	TimescaleDB	Requests per second over time (10s buckets)
Active Users	TimescaleDB	Current concurrent user count
Error Rate	TimescaleDB	Percentage of failed requests
Total Requests	TimescaleDB	Cumulative request count
Avg Latency	TimescaleDB	Average end-to-end latency (ms)
Latency Percentiles	TimescaleDB	P50 / P95 / P99 total latency over time

Token Performance

Panel	Source	Description
Time to First Token	TimescaleDB	TTFT avg and P95 over time
Throughput	TimescaleDB	Output tokens per second
Per-Token Latency	TimescaleDB	Inter-token latency avg and P95

Token Counts

Panel	Source	Description
Output Tokens / Request	TimescaleDB	Average generation length over time
Prompt Tokens / Request	TimescaleDB	Average prompt length over time

GPU Utilization (Simulated A100)

Panel	Source	Description
GPU Utilization %	Prometheus	SM utilization (0–100%)
GPU Memory Used	Prometheus	VRAM usage gauge (80 GB HBM2e)
GPU Temperature	Prometheus	Die temperature in Celsius
GPU Power Draw	Prometheus	Power consumption in watts

Cost Efficiency

Panel	Source	Description
Est. Cost / 1M Tokens	TimescaleDB	Based on $2.21/hr A100 on-demand pricing
Tokens per GPU-Hour	TimescaleDB	Throughput efficiency metric
Efficiency Trend	TimescaleDB	Aggregate tokens/s over time

Logs

Panel	Source	Description
Benchmark Logs	Loki	Filterable output with per-request details

Prerequisites

Tool	Install	Purpose
Docker	Docker Desktop	Runs the observability stack
uv	`curl -LsSf https://astral.sh/uv/install.sh \| sh`	Python package manager
Ollama	`brew install ollama`	Local LLM inference server

Quick Start

git clone https://github.com/rudrakshkarpe/llm-inference-observatory.git
cd llm-inference-observatory

# Start Ollama and pull a model
ollama serve &
ollama pull llama3.2

# Start the observability stack
docker compose up -d --build

# Install Python dependencies
uv sync

# Run the benchmark (5 minutes)
uv run locust \
  --config locust-observability.conf \
  -u 4 -r 1 -t 5m \
  --provider vllm --chat --stream \
  -p 128 -o 64 \
  -m llama3.2 \
  2>&1 | tee logs/locust.log

# Open Grafana
open http://localhost:3000    # login: admin / admin

Or run everything with one command:

MODEL=llama3.2 ./scripts/demo.sh

Project Structure

llm-inference-observatory/
├── pyproject.toml                 # uv project — Locust + benchmark deps
├── uv.lock                        # Deterministic lockfile
├── .python-version                # Python 3.11
├── docker-compose.yml             # Observability stack (6 services)
├── locust-observability.conf      # Locust → Ollama + TimescaleDB config
│
├── benchmark/                     # LLM load testing suite
│   └── llm_bench/
│       ├── load_test.py           # Locust-based load generator
│       ├── bench_serving.py       # Async request-count benchmark
│       └── plot_bench.py          # CSV → PNG plotting utility
│
├── gpu-sim/                       # Simulated NVIDIA GPU metrics
│   ├── Dockerfile
│   ├── pyproject.toml
│   └── exporter.py                # DCGM-compatible Prometheus exporter
│
├── timescaledb/
│   └── init.sql                   # Schema: request, testrun, user_count, events
│
├── prometheus/
│   └── prometheus.yml             # Scrape config for GPU simulator
│
├── loki/
│   └── loki-config.yml            # Loki single-binary configuration
│
├── promtail/
│   └── promtail-config.yml        # Log shipping: benchmark logs → Loki
│
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── datasources.yml    # TimescaleDB + Prometheus + Loki
│       └── dashboards/
│           ├── dashboards.yml     # File-based dashboard provider
│           └── llm-inference.json # LLM Inference Observatory dashboard
│
├── scripts/
│   └── demo.sh                    # One-command demo launcher
│
└── logs/                          # Benchmark output (tee'd for Promtail)

How It Works

Benchmark Engine

The benchmark suite uses Locust to simulate continuous production-like load against OpenAI-compatible LLM endpoints. It supports configurable concurrency, token length distributions (constant, uniform, normal, exponential), streaming, and both time-bounded and request-count-bounded test modes.

Per-request custom metrics are emitted via Locust's event system:

Metric	Unit	Description
`time_to_first_token`	ms	Time from request send to first streamed token
`total_latency`	ms	End-to-end request duration
`throughput (out tok/s)`	tok/s	Output tokens per second per request
`latency_per_token`	ms/tok	Inter-token generation latency
`num_tokens`	count	Tokens generated
`prompt_tokens`	count	Tokens in the prompt

TimescaleDB Schema

The locust-plugins TimescaleDB listener writes every request event into PostgreSQL. The schema (timescaledb/init.sql) creates four tables:

request — hypertable; per-request metrics with time, name, response_time, success, request_type
testrun — benchmark run metadata (start/end time, user count, avg RPS)
user_count — periodic user count snapshots
events — lifecycle events (start, ramp-up complete, stop)

Custom LLM metrics are stored as rows with request_type = 'METRIC' and the metric value in response_time.

GPU Metrics Simulator

Since this POC targets Mac (no NVIDIA GPU), the gpu-sim service generates DCGM-compatible Prometheus metrics:

Polls TimescaleDB for recent request rate (RPS)
Maps RPS → GPU utilization (1 RPS → ~22%, 5 RPS → ~70%, 10 RPS → ~98%)
Derives correlated memory, temperature, power, and SM clock metrics
Applies exponential smoothing + Gaussian noise for visual realism

Exposed metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_SM_CLOCK

In production, replace gpu-sim with nvidia-dcgm-exporter — the dashboard queries work without modification.

Configuration Reference

Locust Parameters

Flag	Description	Default
`-u`	Concurrent users	4
`-r`	Spawn rate (users/sec)	1
`-t`	Test duration	5m
`-p`	Prompt length (tokens)	128
`-o`	Max output tokens	64
`-m`	Model name	—
`--stream`	Enable token streaming	false
`--chat`	Use /v1/chat/completions	false
`--qps`	Fixed queries-per-second mode	—
`--max-tokens-distribution`	Output length distribution (constant/uniform/normal/exponential)	constant

Environment Variables (demo.sh)

Variable	Default	Description
`MODEL`	tinyllama	Ollama model to benchmark
`USERS`	4	Concurrent Locust users
`SPAWN_RATE`	1	Users spawned per second
`DURATION`	5m	Benchmark duration
`PROMPT_TOKENS`	128	Prompt token count
`MAX_TOKENS`	64	Max generation tokens

Service Ports

Service	Port	URL
Grafana	3000	http://localhost:3000
Prometheus	9090	http://localhost:9090
Loki	3100	http://localhost:3100
TimescaleDB	5432	`postgres://postgres:password@localhost:5432/locust`
GPU Simulator	9400	http://localhost:9400/metrics
Ollama	11434	http://localhost:11434

Presentation Demo Flow

Designed for a 25-minute live demo:

Show the empty dashboard — docker compose up -d, open Grafana
Start the benchmark — run Locust, watch panels populate in real-time
Walk through each row — explain what each metric tells you about inference health
Spike the load — increase to 10 users, show latency degradation and GPU saturation
Correlate metrics — "GPU is at 95% but TTFT is spiking — we're compute-bound, not memory-bound"
Check the logs — filter for slow requests in the Loki panel
Cost analysis — "At current throughput we're paying $X per million tokens"
Bridge to production — "Replace gpu-sim with dcgm-exporter, same dashboard"

Using with Real GPUs

To deploy with actual NVIDIA GPUs:

Replace the gpu-sim service in docker-compose.yml:

gpu-metrics:
  image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
  deploy:
    resources:
      reservations:
        devices:
          - capabilities: [gpu]

Update prometheus/prometheus.yml to scrape gpu-metrics:9400
Dashboard queries use standard DCGM metric names — no changes needed

Troubleshooting

Problem	Solution
Dashboard shows "No data"	Check the time range picker — data may be outside the window. Set to "Last 1 hour" or re-run the benchmark
`psycogreen is not installed`	Run `uv sync` — the `locust-plugins[dashboards]` extra installs it
Ollama connection refused	Ensure `ollama serve` is running on the host
TimescaleDB not ready	Wait ~10s after `docker compose up` for the health check
GPU sim shows 0% utilization	The simulator needs active benchmark load to correlate with

Cleanup

docker compose down -v    # stop containers, remove volumes
rm -rf .venv              # remove Python virtual environment

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why Is My GPU Idle? — Observability for LLM Inference in Grafana

The Problem

Architecture

Data Flow

Dashboard Panels

Request Overview

Token Performance

Token Counts

GPU Utilization (Simulated A100)

Cost Efficiency

Logs

Prerequisites

Quick Start

Project Structure

How It Works

Benchmark Engine

TimescaleDB Schema

GPU Metrics Simulator

Configuration Reference

Locust Parameters

Environment Variables (demo.sh)

Service Ports

Presentation Demo Flow

Using with Real GPUs

Troubleshooting

Cleanup

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
benchmark		benchmark
gpu-sim		gpu-sim
grafana/provisioning		grafana/provisioning
logs		logs
loki		loki
prometheus		prometheus
promtail		promtail
scripts		scripts
timescaledb		timescaledb
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
locust-observability.conf		locust-observability.conf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Why Is My GPU Idle? — Observability for LLM Inference in Grafana

The Problem

Architecture

Data Flow

Dashboard Panels

Request Overview

Token Performance

Token Counts

GPU Utilization (Simulated A100)

Cost Efficiency

Logs

Prerequisites

Quick Start

Project Structure

How It Works

Benchmark Engine

TimescaleDB Schema

GPU Metrics Simulator

Configuration Reference

Locust Parameters

Environment Variables (demo.sh)

Service Ports

Presentation Demo Flow

Using with Real GPUs

Troubleshooting

Cleanup

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages