Skip to content

rudrakshkarpe/llm-inference-observatory

Repository files navigation

Why Is My GPU Idle? — Observability for LLM Inference in Grafana

Grafana Prometheus Loki Python uv License

End-to-end observability stack for LLM inference workloads. Correlates GPU utilization, inference latency, token throughput, and cost efficiency on a single Grafana dashboard — built as a live demo for a 25-minute conference presentation.

The Problem

LLM inference is one of the most expensive workloads in the cloud. GPUs cost $2–30/hour, inference latency directly impacts user experience, and traditional observability stacks fail to answer critical questions:

  • Why is my P99 latency spiking? Is it the model, the GPU, or the request queue?
  • Why is my GPU at 30% utilization when I'm getting timeouts?
  • How much am I paying per million tokens, and is it getting worse?
  • Which requests are slow, and what do they have in common?

This project demonstrates how to build visibility into all of these using open-source tools.

Architecture

┌──────────────────────── Host ────────────────────────────┐
│                                                           │
│  ┌─────────────┐         ┌──────────────────────┐        │
│  │   Ollama     │◄────────│  Locust Benchmark    │        │
│  │  (LLM server)│ OpenAI  │  (load generator)    │        │
│  │  :11434      │  API    │                      │────┐   │
│  └─────────────┘         └──────────────────────┘    │   │
│                                                       │   │
└───────────────────────────────────────────────────────│───┘
                                                        │
┌───────────── Docker Compose (Observability) ──────────│───┐
│                                                       ▼   │
│  ┌──────────────┐    Locust metrics     ┌──────────────┐ │
│  │ TimescaleDB   │◄─────────────────────│   Grafana     │ │
│  │ (PostgreSQL)  │    (request table)   │   :3000       │ │
│  └──────────────┘                       │               │ │
│                                          │  ┌──────────┐ │ │
│  ┌──────────────┐    DCGM metrics       │  │Dashboard │ │ │
│  │  Prometheus   │◄── GPU Simulator ───►│  │  Panels  │ │ │
│  │  :9090        │    :9400             │  └──────────┘ │ │
│  └──────────────┘                       │               │ │
│                                          │  ┌──────────┐ │ │
│  ┌──────────────┐    container logs     │  │  Logs    │ │ │
│  │    Loki       │◄── Promtail ────────►│  │  Panel   │ │ │
│  │  :3100        │                      │  └──────────┘ │ │
│  └──────────────┘                       └──────────────┘ │
└──────────────────────────────────────────────────────────┘

Data Flow

  1. Locust sends concurrent inference requests to Ollama via the OpenAI-compatible API
  2. Per-request metrics (TTFT, latency, throughput, token counts) are pushed to TimescaleDB in real-time via the locust-plugins TimescaleDB listener
  3. A GPU Metrics Simulator generates realistic NVIDIA DCGM metrics (A100-style) correlated with actual benchmark load, scraped by Prometheus
  4. Promtail tails benchmark logs and ships them to Loki
  5. Grafana queries all three datasources to render the LLM Inference Observatory dashboard

Dashboard Panels

The LLM Inference Observatory dashboard has six sections:

Request Overview

Panel Source Description
Request Rate TimescaleDB Requests per second over time (10s buckets)
Active Users TimescaleDB Current concurrent user count
Error Rate TimescaleDB Percentage of failed requests
Total Requests TimescaleDB Cumulative request count
Avg Latency TimescaleDB Average end-to-end latency (ms)
Latency Percentiles TimescaleDB P50 / P95 / P99 total latency over time

Token Performance

Panel Source Description
Time to First Token TimescaleDB TTFT avg and P95 over time
Throughput TimescaleDB Output tokens per second
Per-Token Latency TimescaleDB Inter-token latency avg and P95

Token Counts

Panel Source Description
Output Tokens / Request TimescaleDB Average generation length over time
Prompt Tokens / Request TimescaleDB Average prompt length over time

GPU Utilization (Simulated A100)

Panel Source Description
GPU Utilization % Prometheus SM utilization (0–100%)
GPU Memory Used Prometheus VRAM usage gauge (80 GB HBM2e)
GPU Temperature Prometheus Die temperature in Celsius
GPU Power Draw Prometheus Power consumption in watts

Cost Efficiency

Panel Source Description
Est. Cost / 1M Tokens TimescaleDB Based on $2.21/hr A100 on-demand pricing
Tokens per GPU-Hour TimescaleDB Throughput efficiency metric
Efficiency Trend TimescaleDB Aggregate tokens/s over time

Logs

Panel Source Description
Benchmark Logs Loki Filterable output with per-request details

Prerequisites

Tool Install Purpose
Docker Docker Desktop Runs the observability stack
uv curl -LsSf https://astral.sh/uv/install.sh | sh Python package manager
Ollama brew install ollama Local LLM inference server

Quick Start

git clone https://github.com/rudrakshkarpe/llm-inference-observatory.git
cd llm-inference-observatory

# Start Ollama and pull a model
ollama serve &
ollama pull llama3.2

# Start the observability stack
docker compose up -d --build

# Install Python dependencies
uv sync

# Run the benchmark (5 minutes)
uv run locust \
  --config locust-observability.conf \
  -u 4 -r 1 -t 5m \
  --provider vllm --chat --stream \
  -p 128 -o 64 \
  -m llama3.2 \
  2>&1 | tee logs/locust.log

# Open Grafana
open http://localhost:3000    # login: admin / admin

Or run everything with one command:

MODEL=llama3.2 ./scripts/demo.sh

Project Structure

llm-inference-observatory/
├── pyproject.toml                 # uv project — Locust + benchmark deps
├── uv.lock                        # Deterministic lockfile
├── .python-version                # Python 3.11
├── docker-compose.yml             # Observability stack (6 services)
├── locust-observability.conf      # Locust → Ollama + TimescaleDB config
│
├── benchmark/                     # LLM load testing suite
│   └── llm_bench/
│       ├── load_test.py           # Locust-based load generator
│       ├── bench_serving.py       # Async request-count benchmark
│       └── plot_bench.py          # CSV → PNG plotting utility
│
├── gpu-sim/                       # Simulated NVIDIA GPU metrics
│   ├── Dockerfile
│   ├── pyproject.toml
│   └── exporter.py                # DCGM-compatible Prometheus exporter
│
├── timescaledb/
│   └── init.sql                   # Schema: request, testrun, user_count, events
│
├── prometheus/
│   └── prometheus.yml             # Scrape config for GPU simulator
│
├── loki/
│   └── loki-config.yml            # Loki single-binary configuration
│
├── promtail/
│   └── promtail-config.yml        # Log shipping: benchmark logs → Loki
│
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── datasources.yml    # TimescaleDB + Prometheus + Loki
│       └── dashboards/
│           ├── dashboards.yml     # File-based dashboard provider
│           └── llm-inference.json # LLM Inference Observatory dashboard
│
├── scripts/
│   └── demo.sh                    # One-command demo launcher
│
└── logs/                          # Benchmark output (tee'd for Promtail)

How It Works

Benchmark Engine

The benchmark suite uses Locust to simulate continuous production-like load against OpenAI-compatible LLM endpoints. It supports configurable concurrency, token length distributions (constant, uniform, normal, exponential), streaming, and both time-bounded and request-count-bounded test modes.

Per-request custom metrics are emitted via Locust's event system:

Metric Unit Description
time_to_first_token ms Time from request send to first streamed token
total_latency ms End-to-end request duration
throughput (out tok/s) tok/s Output tokens per second per request
latency_per_token ms/tok Inter-token generation latency
num_tokens count Tokens generated
prompt_tokens count Tokens in the prompt

TimescaleDB Schema

The locust-plugins TimescaleDB listener writes every request event into PostgreSQL. The schema (timescaledb/init.sql) creates four tables:

  • request — hypertable; per-request metrics with time, name, response_time, success, request_type
  • testrun — benchmark run metadata (start/end time, user count, avg RPS)
  • user_count — periodic user count snapshots
  • events — lifecycle events (start, ramp-up complete, stop)

Custom LLM metrics are stored as rows with request_type = 'METRIC' and the metric value in response_time.

GPU Metrics Simulator

Since this POC targets Mac (no NVIDIA GPU), the gpu-sim service generates DCGM-compatible Prometheus metrics:

  1. Polls TimescaleDB for recent request rate (RPS)
  2. Maps RPS → GPU utilization (1 RPS → ~22%, 5 RPS → ~70%, 10 RPS → ~98%)
  3. Derives correlated memory, temperature, power, and SM clock metrics
  4. Applies exponential smoothing + Gaussian noise for visual realism

Exposed metrics: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_SM_CLOCK

In production, replace gpu-sim with nvidia-dcgm-exporter — the dashboard queries work without modification.

Configuration Reference

Locust Parameters

Flag Description Default
-u Concurrent users 4
-r Spawn rate (users/sec) 1
-t Test duration 5m
-p Prompt length (tokens) 128
-o Max output tokens 64
-m Model name
--stream Enable token streaming false
--chat Use /v1/chat/completions false
--qps Fixed queries-per-second mode
--max-tokens-distribution Output length distribution (constant/uniform/normal/exponential) constant

Environment Variables (demo.sh)

Variable Default Description
MODEL tinyllama Ollama model to benchmark
USERS 4 Concurrent Locust users
SPAWN_RATE 1 Users spawned per second
DURATION 5m Benchmark duration
PROMPT_TOKENS 128 Prompt token count
MAX_TOKENS 64 Max generation tokens

Service Ports

Service Port URL
Grafana 3000 http://localhost:3000
Prometheus 9090 http://localhost:9090
Loki 3100 http://localhost:3100
TimescaleDB 5432 postgres://postgres:password@localhost:5432/locust
GPU Simulator 9400 http://localhost:9400/metrics
Ollama 11434 http://localhost:11434

Presentation Demo Flow

Designed for a 25-minute live demo:

  1. Show the empty dashboarddocker compose up -d, open Grafana
  2. Start the benchmark — run Locust, watch panels populate in real-time
  3. Walk through each row — explain what each metric tells you about inference health
  4. Spike the load — increase to 10 users, show latency degradation and GPU saturation
  5. Correlate metrics — "GPU is at 95% but TTFT is spiking — we're compute-bound, not memory-bound"
  6. Check the logs — filter for slow requests in the Loki panel
  7. Cost analysis — "At current throughput we're paying $X per million tokens"
  8. Bridge to production — "Replace gpu-sim with dcgm-exporter, same dashboard"

Using with Real GPUs

To deploy with actual NVIDIA GPUs:

  1. Replace the gpu-sim service in docker-compose.yml:
    gpu-metrics:
      image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
      deploy:
        resources:
          reservations:
            devices:
              - capabilities: [gpu]
  2. Update prometheus/prometheus.yml to scrape gpu-metrics:9400
  3. Dashboard queries use standard DCGM metric names — no changes needed

Troubleshooting

Problem Solution
Dashboard shows "No data" Check the time range picker — data may be outside the window. Set to "Last 1 hour" or re-run the benchmark
psycogreen is not installed Run uv sync — the locust-plugins[dashboards] extra installs it
Ollama connection refused Ensure ollama serve is running on the host
TimescaleDB not ready Wait ~10s after docker compose up for the health check
GPU sim shows 0% utilization The simulator needs active benchmark load to correlate with

Cleanup

docker compose down -v    # stop containers, remove volumes
rm -rf .venv              # remove Python virtual environment

License

Apache License 2.0

About

End-to-end observability for LLM inference workloads using Grafana, Prometheus, Loki, and TimescaleDB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages