Skip to content

feat(monitoring): add llm oriented metrics to grafana dashboard#703

Closed
tibo-pdn wants to merge 10 commits into
mainfrom
624-add-llm-oriented-metrics-to-grafana-dashboard
Closed

feat(monitoring): add llm oriented metrics to grafana dashboard#703
tibo-pdn wants to merge 10 commits into
mainfrom
624-add-llm-oriented-metrics-to-grafana-dashboard

Conversation

@tibo-pdn
Copy link
Copy Markdown
Contributor

@tibo-pdn tibo-pdn commented Feb 23, 2026

Add LLM-oriented metrics to Grafana

This PR aims to add LLM-specific metrics to a Grafana Dashboard with models and endpoint details so we can create a link between requests and models (instead of just pure LLM metrics - e.g. the vLLM dashboard that does provide information about requests).

Major Updates

  • Added a _metricsmiddleware.py file that contains 5 new custom LLM-oriented metrics on the API Prometheus:

    1. inference_requests_total: A counter that tracks the total number of LLM inference requests, labeled by endpoint, model, and HTTP status code.

    2. inference_requests_duration_seconds: A histogram that measures the end-to-end duration of LLM requests in seconds, labeled by endpoint, model, and status code, with fine-grained buckets ranging from 50ms to 5 minutes.

    3. inference_ttft_milliseconds: A histogram that measures the time to first token (TTFT) for streaming LLM responses in milliseconds, labeled by endpoint, model, and status code, with buckets ranging from 5ms to 5
      minutes.

    4. inference_output_tokens_per_second: A histogram that measures the output generation speed in tokens per second (completion tokens divided by total request duration), labeled by endpoint and model.

    5. inference_tokens_total: A counter that tracks the total number of tokens consumed, labeled by endpoint, model, and token type (prompt or completion).

  • Added a Grafana dashboard (title: "Inference") that contains several rows:

    1. Traffic: total request count, request rate, and success rate stat panels; a bar gauge of requests broken down by model; time series of request rate and error rate per model & status code.
Screenshot 2026-03-02 at 00 11 17
  1. Latency: time series and bar gauge of end-to-end request duration per model at a configurable percentile (p50–p99).
Screenshot 2026-03-02 at 00 11 32
  1. Time To First Token (TTFT): overall TTFT stat, bar gauge of TTFT per model, and a time series of TTFT evolution per model, all at the selected percentile.
Screenshot 2026-03-02 at 00 11 48
  1. Tokens: total prompt, completion, and combined token count stats; bar gauges of prompt and completion tokens per model; time series of prompt and completion token rates per model.
Screenshot 2026-03-02 at 00 12 04
  1. Output Generation Speed: overall tokens/s stat, bar gauge of generation speed per model, and a time series of generation speed evolution per model, all at the selected percentile.
Screenshot 2026-03-02 at 00 12 16
  • The dashboard includes template variables to filter by datasource, model, endpoint, and percentile.
Screenshot 2026-03-02 at 00 02 57

Warning

  • The Grafana dashboard thresholds (red, orange, green colors) should reflect the aimed SLAs (e.g. TTFT, Output Speed Generation)
  • Some metrics (e.g. inference_requests_duration_seconds, inference_output_tokens_per_second) have a slight overhead, we should be careful and check the cardinality / storage in memory (there are over 20-30 buckets per histogram on some of the metrics). This can cause storage saturation or monitoring server crash, especially if we increase the Prometheus retention duration.
  • On the Output Generation Speed (p95, tokens/s) panel, the model mistral-medium-2508 is often pretty high (>1000 tokens/sec). This seems unrealistic but it seems that this behavior is caused by the KV cache due to my similar prompts during the testing phase. We should check this behavior in production.
  • On the Request Duration by Model (p95) panel, the model mistralai/Ministral-3-8B-Instruct-2512seems to always display the same similar value: about 8.90s. This seems too consistent for different prompt.
  • Endpoints other than /chat/completions have not been tested yet. They shouldn't cause any problem.

How to tests

The tests have already been deployed and tests on the dev (and staging in progress) environments with the latest API and Grafana dashboards versions (of this branch).

Dev Grafana: http://albert.monitoring.001.dev.etalab.gouv.fr/d/opengatellm-inference/inference
Staging Grafana : https://albert.monitoring.001.staging.etalab.gouv.fr/d/opengatellm-inference/inference

Note: The display can be different between the above screenshots and the dashboards on the deployed environments (e.g. the screenshot below). This can be due to a different Grafana version between local and deployed environments.

Screenshot 2026-03-02 at 00 20 01

Minor Updates

  • The /health endpoint has been took out of the metrics function and got its own endpoint file.
  • The PR template developed many weeks ago has been renamed to work (the behaviour will be checked in the future)
  • Outdated inline documentation has been removed.
  • Unused variable has been removed.
  • Some methods have been set as @staticmethod when applicable.

Note Bene

  • The 624-add-llm-oriented-metrics-to-grafana-dashboard branch has been added into the GitHub CI to deploy this specific branch without having to merge it.

@tibo-pdn tibo-pdn self-assigned this Feb 23, 2026
Comment thread api/helpers/_metricsmiddleware.py Fixed
Comment thread api/helpers/_metricsmiddleware.py Fixed
Comment thread api/helpers/_metricsmiddleware.py Fixed
Comment thread api/helpers/_metricsmiddleware.py Fixed
metric.labels(endpoint=endpoint, model=model, type="prompt").inc(usage.prompt_tokens)
if usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 3 months ago

In general, the fix is to stop silently swallowing all exceptions. For non-critical metrics code, the usual pattern is: keep the broad except Exception (so metrics never break requests) but add lightweight logging in the handler so that failures are visible. This maintains existing behavior (no exception propagation) but avoids losing information.

The best fix here is to:

  • Keep the try/except Exception: structure so that metrics failures never affect the main application.
  • In each except block, call a logger to record the exception with context (e.g., which instrumentation function failed).
  • Reuse a single module-level logger (using Python’s standard logging module) so that the rest of the system can route these logs appropriately.

Concretely in api/helpers/_metricsmiddleware.py:

  • Add import logging at the top and define logger = logging.getLogger(__name__) after the imports.
  • For each of the four instrumentation functions shown, replace except Exception:\n pass with except Exception:\n logger.exception("..."), using a message that identifies the specific metric (e.g., "Error recording inference_requests_total metric"). This keeps external behavior the same (no raised exceptions), but ensures errors are visible.

No additional third‑party dependencies are needed; we use Python’s built‑in logging module.

Suggested changeset 1
api/helpers/_metricsmiddleware.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/helpers/_metricsmiddleware.py b/api/helpers/_metricsmiddleware.py
--- a/api/helpers/_metricsmiddleware.py
+++ b/api/helpers/_metricsmiddleware.py
@@ -1,11 +1,14 @@
 from collections.abc import Callable
 
+import logging
 from prometheus_client import Counter, Histogram
 from prometheus_fastapi_instrumentator.metrics import Info
 
 from api.utils.context import request_context
 
+logger = logging.getLogger(__name__)
 
+
 def _build_metric_name(namespace: str, name: str) -> str:
     return f"{namespace}_{name}" if namespace else name
 
@@ -30,7 +27,7 @@
                     status_code=info.modified_status,
                 ).inc()
         except Exception:
-            pass
+            logger.exception("Error recording inference_requests_total metric")
 
     return instrumentation
 
@@ -160,7 +157,7 @@
                     status_code=info.modified_status,
                 ).observe(ttft)
         except Exception:
-            pass
+            logger.exception("Error recording inference_ttft_milliseconds metric")
 
     return instrumentation
 
@@ -184,7 +181,7 @@
             if model and endpoint and usage and latency and usage.completion_tokens:
                 metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
         except Exception:
-            pass
+            logger.exception("Error recording inference_output_tokens_per_second metric")
 
     return instrumentation
 
@@ -209,6 +206,6 @@
                 if usage.completion_tokens:
                     metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
         except Exception:
-            pass
+            logger.exception("Error recording inference_tokens_total metric")
 
     return instrumentation
EOF
@@ -1,11 +1,14 @@
from collections.abc import Callable

import logging
from prometheus_client import Counter, Histogram
from prometheus_fastapi_instrumentator.metrics import Info

from api.utils.context import request_context

logger = logging.getLogger(__name__)


def _build_metric_name(namespace: str, name: str) -> str:
return f"{namespace}_{name}" if namespace else name

@@ -30,7 +27,7 @@
status_code=info.modified_status,
).inc()
except Exception:
pass
logger.exception("Error recording inference_requests_total metric")

return instrumentation

@@ -160,7 +157,7 @@
status_code=info.modified_status,
).observe(ttft)
except Exception:
pass
logger.exception("Error recording inference_ttft_milliseconds metric")

return instrumentation

@@ -184,7 +181,7 @@
if model and endpoint and usage and latency and usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
except Exception:
pass
logger.exception("Error recording inference_output_tokens_per_second metric")

return instrumentation

@@ -209,6 +206,6 @@
if usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
except Exception:
pass
logger.exception("Error recording inference_tokens_total metric")

return instrumentation
Copilot is powered by AI and may make mistakes. Always verify output.
Comment thread api/helpers/_metricsmiddleware.py Fixed
push:
branches:
- main
- 624-add-llm-oriented-metrics-to-grafana-dashboard
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: This will be removed in the future. It aims to allow a deployment from a specific branch.

@tibo-pdn
Copy link
Copy Markdown
Contributor Author

tibo-pdn commented Mar 2, 2026

Some work has been made on the dashboard even when there is no data. You can see below many screenshots of the same dashboard when the timeseries are empty.

Screenshot 2026-03-01 at 12 27 13 Screenshot 2026-03-01 at 12 27 26 Screenshot 2026-03-01 at 12 27 34 Screenshot 2026-03-01 at 12 27 41

model=model,
status_code=info.modified_status,
).inc()
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 3 months ago

In general, empty except blocks should be replaced with handling that either (a) narrows the exception type and/or (b) logs the error and, if appropriate, re-raises or returns a safe default. For metrics middleware, we typically want to ensure that exceptions in metrics code never interfere with request processing, but we should still log them so they can be diagnosed.

The best fix here is to keep the try/except around the metric updates, but replace the except Exception: pass blocks with a handler that logs the exception, scoped clearly as a metrics failure. Since this is FastAPI/Prometheus code, using the standard library logging module is appropriate and doesn’t introduce external dependencies. We’ll add a module-level logger (e.g. logger = logging.getLogger(__name__)) and in each except Exception: block call logger.exception(...) with a short message explaining which metric failed. This preserves the existing behavior of not raising beyond the instrumentation function while eliminating the silent failure.

Concretely:

  • In api/helpers/_metricsmiddleware.py, add import logging and a logger = logging.getLogger(__name__) definition near the top.
  • In inference_requests_total.instrumentation, replace the except Exception: pass with except Exception: logger.exception("Failed to record inference_requests_total metric").
  • In inference_requests_duration_seconds.instrumentation, replace similarly with a message like "Failed to record inference_requests_duration_seconds metric".
  • In inference_output_tokens_per_second.instrumentation, replace with "Failed to record inference_output_tokens_per_second metric".

No new methods are needed beyond the logger definition; no change in function signatures or existing metric logic is required.

Suggested changeset 1
api/helpers/_metricsmiddleware.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/helpers/_metricsmiddleware.py b/api/helpers/_metricsmiddleware.py
--- a/api/helpers/_metricsmiddleware.py
+++ b/api/helpers/_metricsmiddleware.py
@@ -1,11 +1,14 @@
 from collections.abc import Callable
+import logging
 
 from prometheus_client import Counter, Histogram
 from prometheus_fastapi_instrumentator.metrics import Info
 
 from api.utils.context import request_context
 
+logger = logging.getLogger(__name__)
 
+
 def _build_metric_name(namespace: str, name: str) -> str:
     return f"{namespace}_{name}" if namespace else name
 
@@ -30,7 +26,7 @@
                     status_code=info.modified_status,
                 ).inc()
         except Exception:
-            pass
+            logger.exception("Failed to record inference_requests_total metric")
 
     return instrumentation
 
@@ -95,7 +91,7 @@
                     status_code=info.modified_status,
                 ).observe(latency / 1000)
         except Exception:
-            pass
+            logger.exception("Failed to record inference_requests_duration_seconds metric")
 
     return instrumentation
 
@@ -184,7 +180,7 @@
             if model and endpoint and usage and latency and usage.completion_tokens:
                 metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
         except Exception:
-            pass
+            logger.exception("Failed to record inference_output_tokens_per_second metric")
 
     return instrumentation
 
EOF
@@ -1,11 +1,14 @@
from collections.abc import Callable
import logging

from prometheus_client import Counter, Histogram
from prometheus_fastapi_instrumentator.metrics import Info

from api.utils.context import request_context

logger = logging.getLogger(__name__)


def _build_metric_name(namespace: str, name: str) -> str:
return f"{namespace}_{name}" if namespace else name

@@ -30,7 +26,7 @@
status_code=info.modified_status,
).inc()
except Exception:
pass
logger.exception("Failed to record inference_requests_total metric")

return instrumentation

@@ -95,7 +91,7 @@
status_code=info.modified_status,
).observe(latency / 1000)
except Exception:
pass
logger.exception("Failed to record inference_requests_duration_seconds metric")

return instrumentation

@@ -184,7 +180,7 @@
if model and endpoint and usage and latency and usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
except Exception:
pass
logger.exception("Failed to record inference_output_tokens_per_second metric")

return instrumentation

Copilot is powered by AI and may make mistakes. Always verify output.
model=model,
status_code=info.modified_status,
).observe(latency / 1000)
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 3 months ago

General approach: keep the “do not break the request due to metrics failures” behavior, but avoid completely silent exception handling. Add a brief comment stating that errors in metrics should not affect the main flow and log the exception in a non-intrusive way (e.g., via the standard logging module).

Concrete fix:

  • In api/helpers/_metricsmiddleware.py, add an import for the standard-library logging module.
  • Replace the two except Exception: pass blocks inside:
    • inference_requests_duration_seconds(...).instrumentation
    • inference_ttft_milliseconds(...).instrumentation
  • With except Exception: blocks that:
    • include a short comment explaining that metrics errors are intentionally ignored for request safety, and
    • log the exception with logging.getLogger(__name__).exception(...), e.g. logging.getLogger(__name__).exception("Failed to record inference request duration metric").

This preserves existing functionality (no exception propagates to the caller), but prevents completely silent failures and documents the intent.

Specific locations:

  • Add import logging near the top of api/helpers/_metricsmiddleware.py.
  • Modify lines 97–98 and 162–163 accordingly.

No additional non-standard dependencies are needed; logging is from the Python standard library.


Suggested changeset 1
api/helpers/_metricsmiddleware.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/helpers/_metricsmiddleware.py b/api/helpers/_metricsmiddleware.py
--- a/api/helpers/_metricsmiddleware.py
+++ b/api/helpers/_metricsmiddleware.py
@@ -4,6 +4,7 @@
 from prometheus_fastapi_instrumentator.metrics import Info
 
 from api.utils.context import request_context
+import logging
 
 
 def _build_metric_name(namespace: str, name: str) -> str:
@@ -95,7 +96,10 @@
                     status_code=info.modified_status,
                 ).observe(latency / 1000)
         except Exception:
-            pass
+            # Metrics collection must not interfere with request handling; log and continue.
+            logging.getLogger(__name__).exception(
+                "Failed to record inference request duration metric"
+            )
 
     return instrumentation
 
@@ -160,7 +164,10 @@
                     status_code=info.modified_status,
                 ).observe(ttft)
         except Exception:
-            pass
+            # Metrics collection must not interfere with request handling; log and continue.
+            logging.getLogger(__name__).exception(
+                "Failed to record inference TTFT metric"
+            )
 
     return instrumentation
 
EOF
@@ -4,6 +4,7 @@
from prometheus_fastapi_instrumentator.metrics import Info

from api.utils.context import request_context
import logging


def _build_metric_name(namespace: str, name: str) -> str:
@@ -95,7 +96,10 @@
status_code=info.modified_status,
).observe(latency / 1000)
except Exception:
pass
# Metrics collection must not interfere with request handling; log and continue.
logging.getLogger(__name__).exception(
"Failed to record inference request duration metric"
)

return instrumentation

@@ -160,7 +164,10 @@
status_code=info.modified_status,
).observe(ttft)
except Exception:
pass
# Metrics collection must not interfere with request handling; log and continue.
logging.getLogger(__name__).exception(
"Failed to record inference TTFT metric"
)

return instrumentation

Copilot is powered by AI and may make mistakes. Always verify output.
model=model,
status_code=info.modified_status,
).observe(ttft)
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 3 months ago

To fix the problem, keep the broad except Exception to protect the main request handling from metric failures, but replace the empty body with minimal logging that records the error. This preserves existing behavior (exceptions are not re-raised) while avoiding silent failure. Since we must not change existing imports except to add well-known libraries, the least intrusive approach is to use the standard-library logging module.

Concretely, in api/helpers/_metricsmiddleware.py:

  1. Add import logging near the top of the file alongside the existing imports.

  2. In each instrumentation inner function that currently has:

    except Exception:
        pass

    replace it with a logging call, for example:

    except Exception:
        logging.getLogger(__name__).exception(
            "Error while recording %s metric", "<metric_name>"
        )

    where <metric_name> is a short identifier like "inference_requests_total", "inference_ttft_milliseconds", or "inference_tokens_total" corresponding to the function.

This way, any unexpected issues in metric collection are visible in logs, but they still do not interfere with normal request processing. No additional methods or helper functions are required beyond the standard logging import.

Suggested changeset 1
api/helpers/_metricsmiddleware.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/helpers/_metricsmiddleware.py b/api/helpers/_metricsmiddleware.py
--- a/api/helpers/_metricsmiddleware.py
+++ b/api/helpers/_metricsmiddleware.py
@@ -4,6 +4,7 @@
 from prometheus_fastapi_instrumentator.metrics import Info
 
 from api.utils.context import request_context
+import logging
 
 
 def _build_metric_name(namespace: str, name: str) -> str:
@@ -30,7 +31,9 @@
                     status_code=info.modified_status,
                 ).inc()
         except Exception:
-            pass
+            logging.getLogger(__name__).exception(
+                "Error while recording inference_requests_total metric"
+            )
 
     return instrumentation
 
@@ -160,7 +163,9 @@
                     status_code=info.modified_status,
                 ).observe(ttft)
         except Exception:
-            pass
+            logging.getLogger(__name__).exception(
+                "Error while recording inference_ttft_milliseconds metric"
+            )
 
     return instrumentation
 
@@ -209,6 +214,8 @@
                 if usage.completion_tokens:
                     metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
         except Exception:
-            pass
+            logging.getLogger(__name__).exception(
+                "Error while recording inference_tokens_total metric"
+            )
 
     return instrumentation
EOF
@@ -4,6 +4,7 @@
from prometheus_fastapi_instrumentator.metrics import Info

from api.utils.context import request_context
import logging


def _build_metric_name(namespace: str, name: str) -> str:
@@ -30,7 +31,9 @@
status_code=info.modified_status,
).inc()
except Exception:
pass
logging.getLogger(__name__).exception(
"Error while recording inference_requests_total metric"
)

return instrumentation

@@ -160,7 +163,9 @@
status_code=info.modified_status,
).observe(ttft)
except Exception:
pass
logging.getLogger(__name__).exception(
"Error while recording inference_ttft_milliseconds metric"
)

return instrumentation

@@ -209,6 +214,8 @@
if usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
except Exception:
pass
logging.getLogger(__name__).exception(
"Error while recording inference_tokens_total metric"
)

return instrumentation
Copilot is powered by AI and may make mistakes. Always verify output.
latency = context.latency
if model and endpoint and usage and latency and usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 3 months ago

In general, the fix is to avoid completely empty except blocks. Either narrow the exception type and handle it appropriately or, if you must catch broad Exception, at least log it or add an explicit comment justifying an intentional ignore.

For this file, the best fix that does not change existing functionality for callers is:

  • Keep catching Exception to avoid breaking the request due to metrics failures.
  • Add lightweight logging of the exception with enough context (which metric instrumentation failed).
  • Re-raise is not appropriate here because we want metrics failures to be non-fatal; instead, we just log and continue.

Concretely:

  • Introduce a logger at module level using the standard library logging module (a well-known dependency).
  • In each instrumentation function’s except Exception: block (inference_ttft_milliseconds, inference_output_tokens_per_second, inference_tokens_total), replace pass with a logger.exception(...) call that records the failure, possibly including the metric name or function name as context.
  • This requires adding import logging and defining logger = logging.getLogger(__name__) near the top of api/helpers/_metricsmiddleware.py.
Suggested changeset 1
api/helpers/_metricsmiddleware.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/api/helpers/_metricsmiddleware.py b/api/helpers/_metricsmiddleware.py
--- a/api/helpers/_metricsmiddleware.py
+++ b/api/helpers/_metricsmiddleware.py
@@ -1,11 +1,14 @@
 from collections.abc import Callable
+import logging
 
 from prometheus_client import Counter, Histogram
 from prometheus_fastapi_instrumentator.metrics import Info
 
 from api.utils.context import request_context
 
+logger = logging.getLogger(__name__)
 
+
 def _build_metric_name(namespace: str, name: str) -> str:
     return f"{namespace}_{name}" if namespace else name
 
@@ -160,7 +156,7 @@
                     status_code=info.modified_status,
                 ).observe(ttft)
         except Exception:
-            pass
+            logger.exception("Failed to record inference TTFT metric")
 
     return instrumentation
 
@@ -184,7 +180,7 @@
             if model and endpoint and usage and latency and usage.completion_tokens:
                 metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
         except Exception:
-            pass
+            logger.exception("Failed to record inference output tokens per second metric")
 
     return instrumentation
 
@@ -209,6 +205,6 @@
                 if usage.completion_tokens:
                     metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
         except Exception:
-            pass
+            logger.exception("Failed to record inference tokens total metric")
 
     return instrumentation
EOF
@@ -1,11 +1,14 @@
from collections.abc import Callable
import logging

from prometheus_client import Counter, Histogram
from prometheus_fastapi_instrumentator.metrics import Info

from api.utils.context import request_context

logger = logging.getLogger(__name__)


def _build_metric_name(namespace: str, name: str) -> str:
return f"{namespace}_{name}" if namespace else name

@@ -160,7 +156,7 @@
status_code=info.modified_status,
).observe(ttft)
except Exception:
pass
logger.exception("Failed to record inference TTFT metric")

return instrumentation

@@ -184,7 +180,7 @@
if model and endpoint and usage and latency and usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model).observe(usage.completion_tokens / (latency / 1000))
except Exception:
pass
logger.exception("Failed to record inference output tokens per second metric")

return instrumentation

@@ -209,6 +205,6 @@
if usage.completion_tokens:
metric.labels(endpoint=endpoint, model=model, type="completion").inc(usage.completion_tokens)
except Exception:
pass
logger.exception("Failed to record inference tokens total metric")

return instrumentation
Copilot is powered by AI and may make mistakes. Always verify output.
@leoguillaume leoguillaume marked this pull request as ready for review March 3, 2026 16:56
@leoguillaume leoguillaume changed the title 624 add llm oriented metrics to grafana dashboard feat(monitoring): add llm oriented metrics to grafana dashboard Mar 3, 2026
@leoguillaume
Copy link
Copy Markdown
Member

Close due to rebase in #768

@leoguillaume leoguillaume deleted the 624-add-llm-oriented-metrics-to-grafana-dashboard branch March 3, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants