feat(monitoring): add llm oriented metrics to grafana dashboard by tibo-pdn · Pull Request #703 · etalab-ia/OpenGateLLM

tibo-pdn · 2026-02-23T22:56:57Z

Add LLM-oriented metrics to Grafana

This PR aims to add LLM-specific metrics to a Grafana Dashboard with models and endpoint details so we can create a link between requests and models (instead of just pure LLM metrics - e.g. the vLLM dashboard that does provide information about requests).

Major Updates

Added a _metricsmiddleware.py file that contains 5 new custom LLM-oriented metrics on the API Prometheus:
1. inference_requests_total: A counter that tracks the total number of LLM inference requests, labeled by endpoint, model, and HTTP status code.
2. inference_requests_duration_seconds: A histogram that measures the end-to-end duration of LLM requests in seconds, labeled by endpoint, model, and status code, with fine-grained buckets ranging from 50ms to 5 minutes.
3. inference_ttft_milliseconds: A histogram that measures the time to first token (TTFT) for streaming LLM responses in milliseconds, labeled by endpoint, model, and status code, with buckets ranging from 5ms to 5
  minutes.
4. inference_output_tokens_per_second: A histogram that measures the output generation speed in tokens per second (completion tokens divided by total request duration), labeled by endpoint and model.
5. inference_tokens_total: A counter that tracks the total number of tokens consumed, labeled by endpoint, model, and token type (prompt or completion).
Added a Grafana dashboard (title: "Inference") that contains several rows:
1. Traffic: total request count, request rate, and success rate stat panels; a bar gauge of requests broken down by model; time series of request rate and error rate per model & status code.

Latency: time series and bar gauge of end-to-end request duration per model at a configurable percentile (p50–p99).

Time To First Token (TTFT): overall TTFT stat, bar gauge of TTFT per model, and a time series of TTFT evolution per model, all at the selected percentile.

Tokens: total prompt, completion, and combined token count stats; bar gauges of prompt and completion tokens per model; time series of prompt and completion token rates per model.

Output Generation Speed: overall tokens/s stat, bar gauge of generation speed per model, and a time series of generation speed evolution per model, all at the selected percentile.

The dashboard includes template variables to filter by datasource, model, endpoint, and percentile.

Warning

The Grafana dashboard thresholds (red, orange, green colors) should reflect the aimed SLAs (e.g. TTFT, Output Speed Generation)
Some metrics (e.g. inference_requests_duration_seconds, inference_output_tokens_per_second) have a slight overhead, we should be careful and check the cardinality / storage in memory (there are over 20-30 buckets per histogram on some of the metrics). This can cause storage saturation or monitoring server crash, especially if we increase the Prometheus retention duration.
On the Output Generation Speed (p95, tokens/s) panel, the model mistral-medium-2508 is often pretty high (>1000 tokens/sec). This seems unrealistic but it seems that this behavior is caused by the KV cache due to my similar prompts during the testing phase. We should check this behavior in production.
On the Request Duration by Model (p95) panel, the model mistralai/Ministral-3-8B-Instruct-2512seems to always display the same similar value: about 8.90s. This seems too consistent for different prompt.
Endpoints other than /chat/completions have not been tested yet. They shouldn't cause any problem.

How to tests

The tests have already been deployed and tests on the dev (and staging in progress) environments with the latest API and Grafana dashboards versions (of this branch).

Dev Grafana: http://albert.monitoring.001.dev.etalab.gouv.fr/d/opengatellm-inference/inference
Staging Grafana : https://albert.monitoring.001.staging.etalab.gouv.fr/d/opengatellm-inference/inference

Note: The display can be different between the above screenshots and the dashboards on the deployed environments (e.g. the screenshot below). This can be due to a different Grafana version between local and deployed environments.

Minor Updates

The /health endpoint has been took out of the metrics function and got its own endpoint file.
The PR template developed many weeks ago has been renamed to work (the behaviour will be checked in the future)
Outdated inline documentation has been removed.
Unused variable has been removed.
Some methods have been set as @staticmethod when applicable.

Note Bene

The 624-add-llm-oriented-metrics-to-grafana-dashboard branch has been added into the GitHub CI to deploy this specific branch without having to merge it.

…llm-oriented-metrics-to-grafana-dashboard

… a new PR

+                    metric.labels(endpoint=endpoint, model=model, type="prompt").inc(usage.prompt_tokens)
+                if usage.completion_tokens:
+                    metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
+        except Exception:


In general, the fix is to stop silently swallowing all exceptions. For non-critical metrics code, the usual pattern is: keep the broad except Exception (so metrics never break requests) but add lightweight logging in the handler so that failures are visible. This maintains existing behavior (no exception propagation) but avoids losing information.

The best fix here is to:

Keep the try/except Exception: structure so that metrics failures never affect the main application.

In each except block, call a logger to record the exception with context (e.g., which instrumentation function failed).

Reuse a single module-level logger (using Python’s standard logging module) so that the rest of the system can route these logs appropriately.

Concretely in api/helpers/_metricsmiddleware.py:

Add import logging at the top and define logger = logging.getLogger(__name__) after the imports.

For each of the four instrumentation functions shown, replace except Exception:\n pass with except Exception:\n logger.exception("..."), using a message that identifies the specific metric (e.g., "Error recording inference_requests_total metric"). This keeps external behavior the same (no raised exceptions), but ensures errors are visible.

No additional third‑party dependencies are needed; we use Python’s built‑in logging module.

…s (after nginx timeout) - add output speed generation metric

tibo-pdn · 2026-03-01T21:52:58Z

  push:
    branches:
      - main
+      - 624-add-llm-oriented-metrics-to-grafana-dashboard


NB: This will be removed in the future. It aims to allow a deployment from a specific branch.

tibo-pdn · 2026-03-02T15:23:57Z

Some work has been made on the dashboard even when there is no data. You can see below many screenshots of the same dashboard when the timeseries are empty.

+                    model=model,
+                    status_code=info.modified_status,
+                ).inc()
+        except Exception:


In general, empty except blocks should be replaced with handling that either (a) narrows the exception type and/or (b) logs the error and, if appropriate, re-raises or returns a safe default. For metrics middleware, we typically want to ensure that exceptions in metrics code never interfere with request processing, but we should still log them so they can be diagnosed.

The best fix here is to keep the try/except around the metric updates, but replace the except Exception: pass blocks with a handler that logs the exception, scoped clearly as a metrics failure. Since this is FastAPI/Prometheus code, using the standard library logging module is appropriate and doesn’t introduce external dependencies. We’ll add a module-level logger (e.g. logger = logging.getLogger(__name__)) and in each except Exception: block call logger.exception(...) with a short message explaining which metric failed. This preserves the existing behavior of not raising beyond the instrumentation function while eliminating the silent failure.

Concretely:

In api/helpers/_metricsmiddleware.py, add import logging and a logger = logging.getLogger(__name__) definition near the top.

In inference_requests_total.instrumentation, replace the except Exception: pass with except Exception: logger.exception("Failed to record inference_requests_total metric").

In inference_requests_duration_seconds.instrumentation, replace similarly with a message like "Failed to record inference_requests_duration_seconds metric".

In inference_output_tokens_per_second.instrumentation, replace with "Failed to record inference_output_tokens_per_second metric".

No new methods are needed beyond the logger definition; no change in function signatures or existing metric logic is required.

+                    model=model,
+                    status_code=info.modified_status,
+                ).observe(latency / 1000)
+        except Exception:


General approach: keep the “do not break the request due to metrics failures” behavior, but avoid completely silent exception handling. Add a brief comment stating that errors in metrics should not affect the main flow and log the exception in a non-intrusive way (e.g., via the standard logging module).

Concrete fix:

In api/helpers/_metricsmiddleware.py, add an import for the standard-library logging module.

Replace the two except Exception: pass blocks inside:

inference_requests_duration_seconds(...).instrumentation

inference_ttft_milliseconds(...).instrumentation

With except Exception: blocks that:

include a short comment explaining that metrics errors are intentionally ignored for request safety, and

log the exception with logging.getLogger(__name__).exception(...), e.g. logging.getLogger(__name__).exception("Failed to record inference request duration metric").

This preserves existing functionality (no exception propagates to the caller), but prevents completely silent failures and documents the intent.

Specific locations:

Add import logging near the top of api/helpers/_metricsmiddleware.py.

Modify lines 97–98 and 162–163 accordingly.

No additional non-standard dependencies are needed; logging is from the Python standard library.

+                    model=model,
+                    status_code=info.modified_status,
+                ).observe(ttft)
+        except Exception:


To fix the problem, keep the broad except Exception to protect the main request handling from metric failures, but replace the empty body with minimal logging that records the error. This preserves existing behavior (exceptions are not re-raised) while avoiding silent failure. Since we must not change existing imports except to add well-known libraries, the least intrusive approach is to use the standard-library logging module.

Concretely, in api/helpers/_metricsmiddleware.py:

Add import logging near the top of the file alongside the existing imports.

In each instrumentation inner function that currently has:

except Exception: pass

replace it with a logging call, for example:

except Exception: logging.getLogger(__name__).exception( "Error while recording %s metric", "<metric_name>" )

where <metric_name> is a short identifier like "inference_requests_total", "inference_ttft_milliseconds", or "inference_tokens_total" corresponding to the function.

This way, any unexpected issues in metric collection are visible in logs, but they still do not interfere with normal request processing. No additional methods or helper functions are required beyond the standard logging import.

+            latency = context.latency
+            if model and endpoint and usage and latency and usage.completion_tokens:
+                metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
+        except Exception:


In general, the fix is to avoid completely empty except blocks. Either narrow the exception type and handle it appropriately or, if you must catch broad Exception, at least log it or add an explicit comment justifying an intentional ignore.

For this file, the best fix that does not change existing functionality for callers is:

Keep catching Exception to avoid breaking the request due to metrics failures.

Add lightweight logging of the exception with enough context (which metric instrumentation failed).

Re-raise is not appropriate here because we want metrics failures to be non-fatal; instead, we just log and continue.

Concretely:

Introduce a logger at module level using the standard library logging module (a well-known dependency).

In each instrumentation function’s except Exception: block (inference_ttft_milliseconds, inference_output_tokens_per_second, inference_tokens_total), replace pass with a logger.exception(...) call that records the failure, possibly including the metric name or function name as context.

This requires adding import logging and defining logger = logging.getLogger(__name__) near the top of api/helpers/_metricsmiddleware.py.

leoguillaume · 2026-03-03T17:25:36Z

Close due to rebase in #768

tibo-pdn added 4 commits February 20, 2026 00:07

feat(monitoring): add inference_ Prometheus metrics

defac34

Merge branch 'main' of github.com:etalab-ia/OpenGateLLM into 624-add-…

85b1949

…llm-oriented-metrics-to-grafana-dashboard

fix(health): put /health endpoint in its own router

87ba173

chore(github): rename PR template file to make it appear when opening…

a6f54cf

… a new PR

tibo-pdn self-assigned this Feb 23, 2026

github-advanced-security AI found potential problems Feb 23, 2026

View reviewed changes

Comment thread api/helpers/_metricsmiddleware.py Fixed

feat(metrics): refactor metrics with non OOP paradigm

eecc39b

github-advanced-security AI found potential problems Feb 24, 2026

View reviewed changes

tibo-pdn and others added 4 commits February 24, 2026 13:52

chore(deploy): add new temp branch into build_and_deploy.yml workflow

04c6047

feat(metrics): update metrics buckets

d6d2bcd

Update pyproject.toml version

0365ae4

feat(monitoring): add more buckets to metrics to extend until 5minute…

9ae724b

…s (after nginx timeout) - add output speed generation metric

github-advanced-security AI found potential problems Mar 1, 2026

View reviewed changes

Comment thread api/helpers/_metricsmiddleware.py Fixed

tibo-pdn added the improvment label Mar 1, 2026

tibo-pdn commented Mar 1, 2026

View reviewed changes

tibo-pdn requested review from benjaminpilia and leoguillaume March 1, 2026 22:05

feat(monitoring): add prefix to all ogl metrics

0e604d5

github-advanced-security AI found potential problems Mar 3, 2026

View reviewed changes

leoguillaume marked this pull request as ready for review March 3, 2026 16:56

leoguillaume changed the title ~~624 add llm oriented metrics to grafana dashboard~~ feat(monitoring): add llm oriented metrics to grafana dashboard Mar 3, 2026

leoguillaume closed this Mar 3, 2026

leoguillaume deleted the 624-add-llm-oriented-metrics-to-grafana-dashboard branch March 3, 2026 17:37

@@ -1,11 +1,14 @@
             from collections.abc import Callable
+            import logging
             from prometheus_client import Counter, Histogram
             from prometheus_fastapi_instrumentator.metrics import Info
             from api.utils.context import request_context
+            logger = logging.getLogger(__name__)
             def _build_metric_name(namespace: str, name: str) -> str:
                 return f"{namespace}_{name}" if namespace else name
@@ -30,7 +27,7 @@
                                 status_code=info.modified_status,
                             ).inc()
                     except Exception:
-                        pass
+                        logger.exception("Error recording inference_requests_total metric")
                 return instrumentation
@@ -160,7 +157,7 @@
                                 status_code=info.modified_status,
                             ).observe(ttft)
                     except Exception:
-                        pass
+                        logger.exception("Error recording inference_ttft_milliseconds metric")
                 return instrumentation
@@ -184,7 +181,7 @@
                         if model and endpoint and usage and latency and usage.completion_tokens:
                             metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
                     except Exception:
-                        pass
+                        logger.exception("Error recording inference_output_tokens_per_second metric")
                 return instrumentation
@@ -209,6 +206,6 @@
                             if usage.completion_tokens:
                                 metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
                     except Exception:
-                        pass
+                        logger.exception("Error recording inference_tokens_total metric")
                 return instrumentation

@@ -1,11 +1,14 @@
             from collections.abc import Callable
+            import logging
             from prometheus_client import Counter, Histogram
             from prometheus_fastapi_instrumentator.metrics import Info
             from api.utils.context import request_context
+            logger = logging.getLogger(__name__)
             def _build_metric_name(namespace: str, name: str) -> str:
                 return f"{namespace}_{name}" if namespace else name
@@ -160,7 +156,7 @@
                                 status_code=info.modified_status,
                             ).observe(ttft)
                     except Exception:
-                        pass
+                        logger.exception("Failed to record inference TTFT metric")
                 return instrumentation
@@ -184,7 +180,7 @@
                         if model and endpoint and usage and latency and usage.completion_tokens:
                             metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
                     except Exception:
-                        pass
+                        logger.exception("Failed to record inference output tokens per second metric")
                 return instrumentation
@@ -209,6 +205,6 @@
                             if usage.completion_tokens:
                                 metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
                     except Exception:
-                        pass
+                        logger.exception("Failed to record inference tokens total metric")
                 return instrumentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(monitoring): add llm oriented metrics to grafana dashboard#703

feat(monitoring): add llm oriented metrics to grafana dashboard#703
tibo-pdn wants to merge 10 commits into
mainfrom
624-add-llm-oriented-metrics-to-grafana-dashboard

tibo-pdn commented Feb 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Copilot Autofix

Uh oh!

tibo-pdn Mar 1, 2026

Uh oh!

tibo-pdn commented Mar 2, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

leoguillaume commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

tibo-pdn commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add LLM-oriented metrics to Grafana

Major Updates

Warning

How to tests

Minor Updates

Note Bene

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Copilot Autofix

Uh oh!

tibo-pdn Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

tibo-pdn commented Mar 2, 2026

Uh oh!

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

Check notice

Copilot Autofix

leoguillaume commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tibo-pdn commented Feb 23, 2026 •

edited

Loading