Proxying upstream prometheus metrics #824
Replies: 1 comment
-
|
The /upstream endpoint isn't the right place to put it as it would mess with the routing. There's currently a prometheus endpoint at /metrics that would better to use. What you could do is to query all currently loaded servers in parallel with a very short connection timeout and just let the ones that don't support /metrics fail. For the ones that fail I would keep a map[string]bool skip list. There can be a new model config param for "metricsEndpoint" which defaults to /metrics. I would suggest a top level config "fetchModelMetrics" default false. Something to keep in mind is that there will likely be gaps in the data depending on whether the model is loaded or not. Grafana and prometheus seem to handle this well so not a big deal. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Unless this already exists and I'm just missing it this morning while I'm not caffeinated I'm thinking of building this myself
Adding an /upstream/metrics (or maybe another location to avoid path conflicts) that proxies data from the upstream server's prometheus metrics if enabled:
and then the upstream metrics would be proxied from http://llama-swap/upstream/LLama-9.0-AGI/metrics
original from llama.cpp's server
would become
and could be combined with any metrics from whatever models are loaded and get parsed and handled properly. this way i could setup prometheus (or whatever is collecting them) to get all the upstream metrics since they don't have a url through llama-swap that wouldn't cause issues (i.e. directly hitting them is going to constantly cause it to swap models in and out without making actual requests).
I'm thinking this would need a flag on each model to allow it to work since --metrics isn't on by default in llama-server, and proxying to other external services isn't going to expose prometheus metrics for users either.
I'm also looking at a PR for llama.cpp first that's going to expose the VRAM allocations and break down to the prometheus metrics too, since that'll let me do some scanning and planning of parameters for tuning context sizes and -np flags for models.
Beta Was this translation helpful? Give feedback.
All reactions