Proxying upstream prometheus metrics #824

simcop2387 · 2026-06-07T12:28:00Z

simcop2387
Jun 7, 2026

Unless this already exists and I'm just missing it this morning while I'm not caffeinated I'm thinking of building this myself

Adding an /upstream/metrics (or maybe another location to avoid path conflicts) that proxies data from the upstream server's prometheus metrics if enabled:

models:
  LLama-9.0-AGI:
    proxyMetrics: true
    cmd: |
       llama-server --metrics ....
...

and then the upstream metrics would be proxied from http://llama-swap/upstream/LLama-9.0-AGI/metrics

original from llama.cpp's server

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total 73163
# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total 83.483
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total 7637
# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total 160.979
# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total 1736
# HELP llamacpp:n_tokens_max Largest observed n_tokens.
# TYPE llamacpp:n_tokens_max counter
llamacpp:n_tokens_max 63182
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds 876.382
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds 47.441
# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing 0
# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred 0
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode gauge
llamacpp:n_busy_slots_per_decode 1

would become

# HELP llamacpp:prompt_tokens_total Number of prompt tokens processed.
# TYPE llamacpp:prompt_tokens_total counter
llamacpp:prompt_tokens_total{model="LLama-9.0-AGI"} 73163
# HELP llamacpp:prompt_seconds_total Prompt process time
# TYPE llamacpp:prompt_seconds_total counter
llamacpp:prompt_seconds_total{model="LLama-9.0-AGI"} 83.483
# HELP llamacpp:tokens_predicted_total Number of generation tokens processed.
# TYPE llamacpp:tokens_predicted_total counter
llamacpp:tokens_predicted_total{model="LLama-9.0-AGI"} 7637
# HELP llamacpp:tokens_predicted_seconds_total Predict process time
# TYPE llamacpp:tokens_predicted_seconds_total counter
llamacpp:tokens_predicted_seconds_total{model="LLama-9.0-AGI"} 160.979
# HELP llamacpp:n_decode_total Total number of llama_decode() calls
# TYPE llamacpp:n_decode_total counter
llamacpp:n_decode_total{model="LLama-9.0-AGI"} 1736
# HELP llamacpp:n_tokens_max Largest observed n_tokens.
# TYPE llamacpp:n_tokens_max counter
llamacpp:n_tokens_max{model="LLama-9.0-AGI"} 63182
# HELP llamacpp:prompt_tokens_seconds Average prompt throughput in tokens/s.
# TYPE llamacpp:prompt_tokens_seconds gauge
llamacpp:prompt_tokens_seconds{model="LLama-9.0-AGI"} 876.382
# HELP llamacpp:predicted_tokens_seconds Average generation throughput in tokens/s.
# TYPE llamacpp:predicted_tokens_seconds gauge
llamacpp:predicted_tokens_seconds{model="LLama-9.0-AGI"} 47.441
# HELP llamacpp:requests_processing Number of requests processing.
# TYPE llamacpp:requests_processing gauge
llamacpp:requests_processing{model="LLama-9.0-AGI"} 0
# HELP llamacpp:requests_deferred Number of requests deferred.
# TYPE llamacpp:requests_deferred gauge
llamacpp:requests_deferred{model="LLama-9.0-AGI"} 0
# HELP llamacpp:n_busy_slots_per_decode Average number of busy slots per llama_decode() call
# TYPE llamacpp:n_busy_slots_per_decode gauge
llamacpp:n_busy_slots_per_decode{model="LLama-9.0-AGI"} 1

and could be combined with any metrics from whatever models are loaded and get parsed and handled properly. this way i could setup prometheus (or whatever is collecting them) to get all the upstream metrics since they don't have a url through llama-swap that wouldn't cause issues (i.e. directly hitting them is going to constantly cause it to swap models in and out without making actual requests).

I'm thinking this would need a flag on each model to allow it to work since --metrics isn't on by default in llama-server, and proxying to other external services isn't going to expose prometheus metrics for users either.

I'm also looking at a PR for llama.cpp first that's going to expose the VRAM allocations and break down to the prometheus metrics too, since that'll let me do some scanning and planning of parameters for tuning context sizes and -np flags for models.

mostlygeek · 2026-06-07T14:17:28Z

mostlygeek
Jun 7, 2026
Maintainer

The /upstream endpoint isn't the right place to put it as it would mess with the routing.

There's currently a prometheus endpoint at /metrics that would better to use. What you could do is to query all currently loaded servers in parallel with a very short connection timeout and just let the ones that don't support /metrics fail. For the ones that fail I would keep a map[string]bool skip list.

There can be a new model config param for "metricsEndpoint" which defaults to /metrics. I would suggest a top level config "fetchModelMetrics" default false.

Something to keep in mind is that there will likely be gaps in the data depending on whether the model is loaded or not. Grafana and prometheus seem to handle this well so not a big deal.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxying upstream prometheus metrics #824

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Proxying upstream prometheus metrics #824

Uh oh!

simcop2387 Jun 7, 2026

Replies: 1 comment

Uh oh!

Uh oh!

mostlygeek Jun 7, 2026 Maintainer

simcop2387
Jun 7, 2026

mostlygeek
Jun 7, 2026
Maintainer