feat(monitoring): add llm oriented metrics to grafana dashboard#703
feat(monitoring): add llm oriented metrics to grafana dashboard#703tibo-pdn wants to merge 10 commits into
Conversation
…llm-oriented-metrics-to-grafana-dashboard
| metric.labels(endpoint=endpoint, model=model, type="prompt").inc(usage.prompt_tokens) | ||
| if usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
In general, the fix is to stop silently swallowing all exceptions. For non-critical metrics code, the usual pattern is: keep the broad except Exception (so metrics never break requests) but add lightweight logging in the handler so that failures are visible. This maintains existing behavior (no exception propagation) but avoids losing information.
The best fix here is to:
- Keep the
try/except Exception:structure so that metrics failures never affect the main application. - In each
exceptblock, call a logger to record the exception with context (e.g., which instrumentation function failed). - Reuse a single module-level logger (using Python’s standard
loggingmodule) so that the rest of the system can route these logs appropriately.
Concretely in api/helpers/_metricsmiddleware.py:
- Add
import loggingat the top and definelogger = logging.getLogger(__name__)after the imports. - For each of the four
instrumentationfunctions shown, replaceexcept Exception:\n passwithexcept Exception:\n logger.exception("..."), using a message that identifies the specific metric (e.g.,"Error recording inference_requests_total metric"). This keeps external behavior the same (no raised exceptions), but ensures errors are visible.
No additional third‑party dependencies are needed; we use Python’s built‑in logging module.
| @@ -1,11 +1,14 @@ | ||
| from collections.abc import Callable | ||
|
|
||
| import logging | ||
| from prometheus_client import Counter, Histogram | ||
| from prometheus_fastapi_instrumentator.metrics import Info | ||
|
|
||
| from api.utils.context import request_context | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _build_metric_name(namespace: str, name: str) -> str: | ||
| return f"{namespace}_{name}" if namespace else name | ||
|
|
||
| @@ -30,7 +27,7 @@ | ||
| status_code=info.modified_status, | ||
| ).inc() | ||
| except Exception: | ||
| pass | ||
| logger.exception("Error recording inference_requests_total metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -160,7 +157,7 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(ttft) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Error recording inference_ttft_milliseconds metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -184,7 +181,7 @@ | ||
| if model and endpoint and usage and latency and usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000)) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Error recording inference_output_tokens_per_second metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -209,6 +206,6 @@ | ||
| if usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Error recording inference_tokens_total metric") | ||
|
|
||
| return instrumentation |
…s (after nginx timeout) - add output speed generation metric
| push: | ||
| branches: | ||
| - main | ||
| - 624-add-llm-oriented-metrics-to-grafana-dashboard |
There was a problem hiding this comment.
NB: This will be removed in the future. It aims to allow a deployment from a specific branch.
| model=model, | ||
| status_code=info.modified_status, | ||
| ).inc() | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
In general, empty except blocks should be replaced with handling that either (a) narrows the exception type and/or (b) logs the error and, if appropriate, re-raises or returns a safe default. For metrics middleware, we typically want to ensure that exceptions in metrics code never interfere with request processing, but we should still log them so they can be diagnosed.
The best fix here is to keep the try/except around the metric updates, but replace the except Exception: pass blocks with a handler that logs the exception, scoped clearly as a metrics failure. Since this is FastAPI/Prometheus code, using the standard library logging module is appropriate and doesn’t introduce external dependencies. We’ll add a module-level logger (e.g. logger = logging.getLogger(__name__)) and in each except Exception: block call logger.exception(...) with a short message explaining which metric failed. This preserves the existing behavior of not raising beyond the instrumentation function while eliminating the silent failure.
Concretely:
- In
api/helpers/_metricsmiddleware.py, addimport loggingand alogger = logging.getLogger(__name__)definition near the top. - In
inference_requests_total.instrumentation, replace theexcept Exception: passwithexcept Exception: logger.exception("Failed to record inference_requests_total metric"). - In
inference_requests_duration_seconds.instrumentation, replace similarly with a message like"Failed to record inference_requests_duration_seconds metric". - In
inference_output_tokens_per_second.instrumentation, replace with"Failed to record inference_output_tokens_per_second metric".
No new methods are needed beyond the logger definition; no change in function signatures or existing metric logic is required.
| @@ -1,11 +1,14 @@ | ||
| from collections.abc import Callable | ||
| import logging | ||
|
|
||
| from prometheus_client import Counter, Histogram | ||
| from prometheus_fastapi_instrumentator.metrics import Info | ||
|
|
||
| from api.utils.context import request_context | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _build_metric_name(namespace: str, name: str) -> str: | ||
| return f"{namespace}_{name}" if namespace else name | ||
|
|
||
| @@ -30,7 +26,7 @@ | ||
| status_code=info.modified_status, | ||
| ).inc() | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference_requests_total metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -95,7 +91,7 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(latency / 1000) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference_requests_duration_seconds metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -184,7 +180,7 @@ | ||
| if model and endpoint and usage and latency and usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000)) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference_output_tokens_per_second metric") | ||
|
|
||
| return instrumentation | ||
|
|
| model=model, | ||
| status_code=info.modified_status, | ||
| ).observe(latency / 1000) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
General approach: keep the “do not break the request due to metrics failures” behavior, but avoid completely silent exception handling. Add a brief comment stating that errors in metrics should not affect the main flow and log the exception in a non-intrusive way (e.g., via the standard logging module).
Concrete fix:
- In
api/helpers/_metricsmiddleware.py, add an import for the standard-libraryloggingmodule. - Replace the two
except Exception: passblocks inside:inference_requests_duration_seconds(...).instrumentationinference_ttft_milliseconds(...).instrumentation
- With
except Exception:blocks that:- include a short comment explaining that metrics errors are intentionally ignored for request safety, and
- log the exception with
logging.getLogger(__name__).exception(...), e.g.logging.getLogger(__name__).exception("Failed to record inference request duration metric").
This preserves existing functionality (no exception propagates to the caller), but prevents completely silent failures and documents the intent.
Specific locations:
- Add
import loggingnear the top ofapi/helpers/_metricsmiddleware.py. - Modify lines 97–98 and 162–163 accordingly.
No additional non-standard dependencies are needed; logging is from the Python standard library.
| @@ -4,6 +4,7 @@ | ||
| from prometheus_fastapi_instrumentator.metrics import Info | ||
|
|
||
| from api.utils.context import request_context | ||
| import logging | ||
|
|
||
|
|
||
| def _build_metric_name(namespace: str, name: str) -> str: | ||
| @@ -95,7 +96,10 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(latency / 1000) | ||
| except Exception: | ||
| pass | ||
| # Metrics collection must not interfere with request handling; log and continue. | ||
| logging.getLogger(__name__).exception( | ||
| "Failed to record inference request duration metric" | ||
| ) | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -160,7 +164,10 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(ttft) | ||
| except Exception: | ||
| pass | ||
| # Metrics collection must not interfere with request handling; log and continue. | ||
| logging.getLogger(__name__).exception( | ||
| "Failed to record inference TTFT metric" | ||
| ) | ||
|
|
||
| return instrumentation | ||
|
|
| model=model, | ||
| status_code=info.modified_status, | ||
| ).observe(ttft) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
To fix the problem, keep the broad except Exception to protect the main request handling from metric failures, but replace the empty body with minimal logging that records the error. This preserves existing behavior (exceptions are not re-raised) while avoiding silent failure. Since we must not change existing imports except to add well-known libraries, the least intrusive approach is to use the standard-library logging module.
Concretely, in api/helpers/_metricsmiddleware.py:
-
Add
import loggingnear the top of the file alongside the existing imports. -
In each
instrumentationinner function that currently has:except Exception: pass
replace it with a logging call, for example:
except Exception: logging.getLogger(__name__).exception( "Error while recording %s metric", "<metric_name>" )
where
<metric_name>is a short identifier like"inference_requests_total","inference_ttft_milliseconds", or"inference_tokens_total"corresponding to the function.
This way, any unexpected issues in metric collection are visible in logs, but they still do not interfere with normal request processing. No additional methods or helper functions are required beyond the standard logging import.
| @@ -4,6 +4,7 @@ | ||
| from prometheus_fastapi_instrumentator.metrics import Info | ||
|
|
||
| from api.utils.context import request_context | ||
| import logging | ||
|
|
||
|
|
||
| def _build_metric_name(namespace: str, name: str) -> str: | ||
| @@ -30,7 +31,9 @@ | ||
| status_code=info.modified_status, | ||
| ).inc() | ||
| except Exception: | ||
| pass | ||
| logging.getLogger(__name__).exception( | ||
| "Error while recording inference_requests_total metric" | ||
| ) | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -160,7 +163,9 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(ttft) | ||
| except Exception: | ||
| pass | ||
| logging.getLogger(__name__).exception( | ||
| "Error while recording inference_ttft_milliseconds metric" | ||
| ) | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -209,6 +214,8 @@ | ||
| if usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens) | ||
| except Exception: | ||
| pass | ||
| logging.getLogger(__name__).exception( | ||
| "Error while recording inference_tokens_total metric" | ||
| ) | ||
|
|
||
| return instrumentation |
| latency = context.latency | ||
| if model and endpoint and usage and latency and usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000)) | ||
| except Exception: |
Check notice
Code scanning / CodeQL
Empty except Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 months ago
In general, the fix is to avoid completely empty except blocks. Either narrow the exception type and handle it appropriately or, if you must catch broad Exception, at least log it or add an explicit comment justifying an intentional ignore.
For this file, the best fix that does not change existing functionality for callers is:
- Keep catching
Exceptionto avoid breaking the request due to metrics failures. - Add lightweight logging of the exception with enough context (which metric instrumentation failed).
- Re-raise is not appropriate here because we want metrics failures to be non-fatal; instead, we just log and continue.
Concretely:
- Introduce a logger at module level using the standard library
loggingmodule (a well-known dependency). - In each
instrumentationfunction’sexcept Exception:block (inference_ttft_milliseconds,inference_output_tokens_per_second,inference_tokens_total), replacepasswith alogger.exception(...)call that records the failure, possibly including the metric name or function name as context. - This requires adding
import loggingand defininglogger = logging.getLogger(__name__)near the top ofapi/helpers/_metricsmiddleware.py.
| @@ -1,11 +1,14 @@ | ||
| from collections.abc import Callable | ||
| import logging | ||
|
|
||
| from prometheus_client import Counter, Histogram | ||
| from prometheus_fastapi_instrumentator.metrics import Info | ||
|
|
||
| from api.utils.context import request_context | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _build_metric_name(namespace: str, name: str) -> str: | ||
| return f"{namespace}_{name}" if namespace else name | ||
|
|
||
| @@ -160,7 +156,7 @@ | ||
| status_code=info.modified_status, | ||
| ).observe(ttft) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference TTFT metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -184,7 +180,7 @@ | ||
| if model and endpoint and usage and latency and usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000)) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference output tokens per second metric") | ||
|
|
||
| return instrumentation | ||
|
|
||
| @@ -209,6 +205,6 @@ | ||
| if usage.completion_tokens: | ||
| metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens) | ||
| except Exception: | ||
| pass | ||
| logger.exception("Failed to record inference tokens total metric") | ||
|
|
||
| return instrumentation |
|
Close due to rebase in #768 |




Add LLM-oriented metrics to Grafana
This PR aims to add LLM-specific metrics to a Grafana Dashboard with models and endpoint details so we can create a link between requests and models (instead of just pure LLM metrics - e.g. the vLLM dashboard that does provide information about requests).
Major Updates
Added a
_metricsmiddleware.pyfile that contains 5 new custom LLM-oriented metrics on the API Prometheus:inference_requests_total: A counter that tracks the total number of LLM inference requests, labeled by endpoint, model, and HTTP status code.inference_requests_duration_seconds: A histogram that measures the end-to-end duration of LLM requests in seconds, labeled by endpoint, model, and status code, with fine-grained buckets ranging from 50ms to 5 minutes.inference_ttft_milliseconds: A histogram that measures the time to first token (TTFT) for streaming LLM responses in milliseconds, labeled by endpoint, model, and status code, with buckets ranging from 5ms to 5minutes.
inference_output_tokens_per_second: A histogram that measures the output generation speed in tokens per second (completion tokens divided by total request duration), labeled by endpoint and model.inference_tokens_total: A counter that tracks the total number of tokens consumed, labeled by endpoint, model, and token type (prompt or completion).Added a Grafana dashboard (title: "Inference") that contains several rows:
Traffic: total request count, request rate, and success rate stat panels; a bar gauge of requests broken down by model; time series of request rate and error rate per model & status code.Latency: time series and bar gauge of end-to-end request duration per model at a configurable percentile (p50–p99).Time To First Token (TTFT): overall TTFT stat, bar gauge of TTFT per model, and a time series of TTFT evolution per model, all at the selected percentile.Tokens: total prompt, completion, and combined token count stats; bar gauges of prompt and completion tokens per model; time series of prompt and completion token rates per model.Output Generation Speed: overall tokens/s stat, bar gauge of generation speed per model, and a time series of generation speed evolution per model, all at the selected percentile.datasource,model,endpoint, andpercentile.Warning
inference_requests_duration_seconds,inference_output_tokens_per_second) have a slight overhead, we should be careful and check the cardinality / storage in memory (there are over 20-30 buckets per histogram on some of the metrics). This can cause storage saturation or monitoring server crash, especially if we increase the Prometheus retention duration.Output Generation Speed (p95, tokens/s)panel, the modelmistral-medium-2508is often pretty high (>1000 tokens/sec). This seems unrealistic but it seems that this behavior is caused by the KV cache due to my similar prompts during the testing phase. We should check this behavior in production.Request Duration by Model (p95)panel, the modelmistralai/Ministral-3-8B-Instruct-2512seems to always display the same similar value: about8.90s. This seems too consistent for different prompt./chat/completionshave not been tested yet. They shouldn't cause any problem.How to tests
The tests have already been deployed and tests on the
dev(andstagingin progress) environments with the latest API and Grafana dashboards versions (of this branch).Dev Grafana: http://albert.monitoring.001.dev.etalab.gouv.fr/d/opengatellm-inference/inference
Staging Grafana : https://albert.monitoring.001.staging.etalab.gouv.fr/d/opengatellm-inference/inference
Note: The display can be different between the above screenshots and the dashboards on the deployed environments (e.g. the screenshot below). This can be due to a different Grafana version between local and deployed environments.
Minor Updates
/healthendpoint has been took out of the metrics function and got its own endpoint file.@staticmethodwhen applicable.Note Bene
624-add-llm-oriented-metrics-to-grafana-dashboardbranch has been added into the GitHub CI to deploy this specific branch without having to merge it.