The gateway exposes OpenTelemetry metrics via a Prometheus exporter. When enabled, metrics are available at GET /metrics in the standard Prometheus text format.
metrics:
enabled: trueAll metric names are prefixed with llm_gateway..
llm_gateway.requests (counter) -- total chat completion requests.
| Attribute | Values | Description |
|---|---|---|
provider |
openai, anthropic, ollama |
which provider handled the request |
model |
model name | the model used |
streaming |
true, false |
whether the request was streaming |
key |
key name or empty | the API key name (when virtual API keys are enabled) |
llm_gateway.request.duration (histogram, seconds) -- end-to-end request duration including upstream provider latency.
| Attribute | Values | Description |
|---|---|---|
provider |
openai, anthropic, ollama |
which provider handled the request |
model |
model name | the model used |
key |
key name or empty | the API key name (when virtual API keys are enabled) |
llm_gateway.requests.inflight (up-down counter) -- number of requests currently being processed. Incremented when a request enters the handler, decremented when it completes. Useful for understanding concurrency and detecting request pileups.
llm_gateway.tokens.prompt (counter) -- total prompt (input) tokens across all requests.
| Attribute | Values | Description |
|---|---|---|
provider |
provider name | which provider reported the usage |
model |
model name | the model used |
llm_gateway.tokens.completion (counter) -- total completion (output) tokens across all requests.
| Attribute | Values | Description |
|---|---|---|
provider |
provider name | which provider reported the usage |
model |
model name | the model used |
Token metrics are recorded from the usage field in non-streaming chat completion responses. Streaming responses typically do not include token counts.
llm_gateway.routing.decisions (counter) -- semantic routing decisions, counted each time the router selects a model.
| Attribute | Values | Description |
|---|---|---|
method |
explicit, heuristic, semantic, classifier, default |
which routing layer made the decision |
A high proportion of default decisions may indicate that thresholds are too strict or that route examples don't cover your traffic well.
llm_gateway.provider.errors (counter) -- errors returned by upstream providers.
| Attribute | Values | Description |
|---|---|---|
error_type |
invalid_request_error, authentication_error, rate_limit_error, server_error, not_found_error, service_unavailable, unknown |
the error category |
llm_gateway.endpoint.healthy (up-down counter) -- per-endpoint health status for multi-endpoint mode. Value is 1 for healthy endpoints and 0 for unhealthy endpoints. The endpoint attribute identifies the endpoint by name.
Point your Prometheus instance at the gateway's /metrics endpoint:
# prometheus.yml
scrape_configs:
- job_name: llm-gateway
scrape_interval: 15s
static_configs:
- targets: ["localhost:8080"]Requests per minute by provider:
rate(llm_gateway_requests_total[5m]) * 60
Average request duration by model:
rate(llm_gateway_request_duration_seconds_sum[5m]) / rate(llm_gateway_request_duration_seconds_count[5m])
Token throughput (tokens per second):
rate(llm_gateway_tokens_prompt_total[5m]) + rate(llm_gateway_tokens_completion_total[5m])
Error rate as a percentage of total requests:
rate(llm_gateway_provider_errors_total[5m]) / rate(llm_gateway_requests_total[5m]) * 100
Routing method distribution:
rate(llm_gateway_routing_decisions_total[5m])
Current in-flight requests:
llm_gateway_requests_inflight