feat(metrics): characterize-metric-anomaly endpoint and MCP tool#63131
feat(metrics): characterize-metric-anomaly endpoint and MCP tool#63131DanielVisca wants to merge 3 commits into
Conversation
MCP UI Apps size report
|
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
products/metrics/mcp/prompts/characterize-metric-anomaly.md:7-12
Numbered lists in prompts can cause confusion for AI consumers that may interpret the numbers as part of a user response. The custom rule prefers bullet points or letters instead.
```suggestion
Read the report in this order:
- `direction` + `change_ratio` + `anomaly_peak`: how bad is it? (`flat` means the windows don't meaningfully differ — widen the anomaly window or check you have the right metric.)
- `onset_time`: when it started. Use this exact timestamp as the pivot for cross-signal correlation.
- `top_movers`: which label values changed. One label value moving alone (e.g. a single pod or shard) points at a localized culprit; everything moving together points at a shared cause upstream.
- `series`: the full baseline+anomaly time series if you need to eyeball the shape.
```
### Issue 2 of 3
products/metrics/backend/tests/test_anomaly.py:86-90
Two distinct invalid-window cases inside one test — the codebase prefers parameterised tests. Each condition (`anomaly_to <= anomaly_from` and `baseline_to > anomaly_from`) would be a separate row in a `@pytest.mark.parametrize` (or `subTest`) table, making it easier to see which guard failed when either assertion breaks.
### Issue 3 of 3
products/metrics/backend/anomaly.py:231
Candidate-key discovery queries only the anomaly window (`anomaly_from` → `anomaly_to`). A label value that was present during the baseline but disappeared during the anomaly — e.g. a pod that crashed or a shard that went offline — will never be discovered here, so its disappearance won't surface in `top_movers`. Extending the scan to the full combined window (`baseline_from` → `anomaly_to`) would catch this class of "drop to zero" signals.
Reviews (1): Last reviewed commit: "chore: update OpenAPI generated types" | Re-trigger Greptile |
| Read the report in this order: | ||
|
|
||
| 1. `direction` + `change_ratio` + `anomaly_peak`: how bad is it? (`flat` means the windows don't meaningfully differ — widen the anomaly window or check you have the right metric.) | ||
| 2. `onset_time`: when it started. Use this exact timestamp as the pivot for cross-signal correlation. | ||
| 3. `top_movers`: which label values changed. One label value moving alone (e.g. a single pod or shard) points at a localized culprit; everything moving together points at a shared cause upstream. | ||
| 4. `series`: the full baseline+anomaly time series if you need to eyeball the shape. |
There was a problem hiding this comment.
Numbered lists in prompts can cause confusion for AI consumers that may interpret the numbers as part of a user response. The custom rule prefers bullet points or letters instead.
| Read the report in this order: | |
| 1. `direction` + `change_ratio` + `anomaly_peak`: how bad is it? (`flat` means the windows don't meaningfully differ — widen the anomaly window or check you have the right metric.) | |
| 2. `onset_time`: when it started. Use this exact timestamp as the pivot for cross-signal correlation. | |
| 3. `top_movers`: which label values changed. One label value moving alone (e.g. a single pod or shard) points at a localized culprit; everything moving together points at a shared cause upstream. | |
| 4. `series`: the full baseline+anomaly time series if you need to eyeball the shape. | |
| Read the report in this order: | |
| - `direction` + `change_ratio` + `anomaly_peak`: how bad is it? (`flat` means the windows don't meaningfully differ — widen the anomaly window or check you have the right metric.) | |
| - `onset_time`: when it started. Use this exact timestamp as the pivot for cross-signal correlation. | |
| - `top_movers`: which label values changed. One label value moving alone (e.g. a single pod or shard) points at a localized culprit; everything moving together points at a shared cause upstream. | |
| - `series`: the full baseline+anomaly time series if you need to eyeball the shape. |
Rule Used: In prompts, avoid using numbers in lists to preven... (source)
Learned From
PostHog/posthog#31624
Prompt To Fix With AI
This is a comment left during a code review.
Path: products/metrics/mcp/prompts/characterize-metric-anomaly.md
Line: 7-12
Comment:
Numbered lists in prompts can cause confusion for AI consumers that may interpret the numbers as part of a user response. The custom rule prefers bullet points or letters instead.
```suggestion
Read the report in this order:
- `direction` + `change_ratio` + `anomaly_peak`: how bad is it? (`flat` means the windows don't meaningfully differ — widen the anomaly window or check you have the right metric.)
- `onset_time`: when it started. Use this exact timestamp as the pivot for cross-signal correlation.
- `top_movers`: which label values changed. One label value moving alone (e.g. a single pod or shard) points at a localized culprit; everything moving together points at a shared cause upstream.
- `series`: the full baseline+anomaly time series if you need to eyeball the shape.
```
**Rule Used:** In prompts, avoid using numbers in lists to preven... ([source](https://app.greptile.com/posthog-org-19734/-/custom-context?memory=1dd45b94-c307-4dd0-a025-5aa16059229b))
**Learned From**
[PostHog/posthog#31624](https://github.com/PostHog/posthog/pull/31624)
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| def test_rejects_inverted_windows(self): | ||
| with self.assertRaises(ValueError): | ||
| self._characterize(anomaly_to=self.anomaly_from - dt.timedelta(minutes=1)) | ||
| with self.assertRaises(ValueError): | ||
| self._characterize(baseline_to=self.anomaly_from + dt.timedelta(minutes=5)) |
There was a problem hiding this comment.
Two distinct invalid-window cases inside one test — the codebase prefers parameterised tests. Each condition (
anomaly_to <= anomaly_from and baseline_to > anomaly_from) would be a separate row in a @pytest.mark.parametrize (or subTest) table, making it easier to see which guard failed when either assertion breaks.
Prompt To Fix With AI
This is a comment left during a code review.
Path: products/metrics/backend/tests/test_anomaly.py
Line: 86-90
Comment:
Two distinct invalid-window cases inside one test — the codebase prefers parameterised tests. Each condition (`anomaly_to <= anomaly_from` and `baseline_to > anomaly_from`) would be a separate row in a `@pytest.mark.parametrize` (or `subTest`) table, making it easier to see which guard failed when either assertion breaks.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| filters: tuple[MetricFilter, ...], | ||
| candidate_keys: tuple[str, ...] | None, | ||
| ) -> tuple[MetricAnomalyDimension, ...]: | ||
| keys = candidate_keys or _discover_candidate_keys(team, metric_name, anomaly_from, anomaly_to) |
There was a problem hiding this comment.
Candidate-key discovery queries only the anomaly window (
anomaly_from → anomaly_to). A label value that was present during the baseline but disappeared during the anomaly — e.g. a pod that crashed or a shard that went offline — will never be discovered here, so its disappearance won't surface in top_movers. Extending the scan to the full combined window (baseline_from → anomaly_to) would catch this class of "drop to zero" signals.
Prompt To Fix With AI
This is a comment left during a code review.
Path: products/metrics/backend/anomaly.py
Line: 231
Comment:
Candidate-key discovery queries only the anomaly window (`anomaly_from` → `anomaly_to`). A label value that was present during the baseline but disappeared during the anomaly — e.g. a pod that crashed or a shard that went offline — will never be discovered here, so its disappearance won't surface in `top_movers`. Extending the scan to the full combined window (`baseline_from` → `anomaly_to`) would catch this class of "drop to zero" signals.
How can I resolve this? If you propose a fix, please make it concise.|
Size Change: 0 B Total Size: 72.4 MB ℹ️ View Unchanged
|
113a13f to
98f6b8e
Compare
4c1ea15 to
53cc0f6
Compare
| filters: tuple[MetricFilter, ...], | ||
| candidate_keys: tuple[str, ...] | None, | ||
| ) -> tuple[MetricAnomalyDimension, ...]: | ||
| keys = candidate_keys or _discover_candidate_keys(team, metric_name, anomaly_from, anomaly_to) |
There was a problem hiding this comment.
Auto-discovery uses only the anomaly window to find candidate keys, which will miss labels that existed in the baseline but disappeared during the anomaly. This could fail to identify the root cause when services/shards stop reporting.
# Bug: discovers from anomaly window only
keys = candidate_keys or _discover_candidate_keys(team, metric_name, anomaly_from, anomaly_to)
# Fix: discover from the full baseline+anomaly span
keys = candidate_keys or _discover_candidate_keys(team, metric_name, baseline_from, anomaly_to)The function already has baseline_from available via the closure (line 105), so pass it instead of anomaly_from to ensure labels from both windows are considered.
| keys = candidate_keys or _discover_candidate_keys(team, metric_name, anomaly_from, anomaly_to) | |
| keys = candidate_keys or _discover_candidate_keys(team, metric_name, baseline_from, anomaly_to) | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
fb9d3f3 to
56996b6
Compare
| if anomaly_to <= anomaly_from: | ||
| raise ValueError("anomaly_to must be after anomaly_from") | ||
| if baseline_to is None: | ||
| baseline_to = anomaly_from | ||
| if baseline_from is None: | ||
| baseline_from = baseline_to - (anomaly_to - anomaly_from) | ||
| if baseline_to > anomaly_from: | ||
| raise ValueError("the baseline window must end at or before anomaly_from") |
There was a problem hiding this comment.
Missing validation for baseline_from < baseline_to. The code validates that anomaly_to > anomaly_from and baseline_to <= anomaly_from, but never checks if baseline_from < baseline_to.
If a caller explicitly passes baseline_from >= baseline_to, the function will:
- Run the query from
baseline_fromtoanomaly_to - Filter points with
time < anomaly_from_isoas baseline (line 135) - Produce incorrect or empty baseline_values, leading to invalid statistical calculations
if baseline_from >= baseline_to:
raise ValueError("baseline_from must be before baseline_to")Add this check after line 105 to prevent invalid window configurations.
| if anomaly_to <= anomaly_from: | |
| raise ValueError("anomaly_to must be after anomaly_from") | |
| if baseline_to is None: | |
| baseline_to = anomaly_from | |
| if baseline_from is None: | |
| baseline_from = baseline_to - (anomaly_to - anomaly_from) | |
| if baseline_to > anomaly_from: | |
| raise ValueError("the baseline window must end at or before anomaly_from") | |
| if anomaly_to <= anomaly_from: | |
| raise ValueError("anomaly_to must be after anomaly_from") | |
| if baseline_to is None: | |
| baseline_to = anomaly_from | |
| if baseline_from is None: | |
| baseline_from = baseline_to - (anomaly_to - anomaly_from) | |
| if baseline_from >= baseline_to: | |
| raise ValueError("baseline_from must be before baseline_to") | |
| if baseline_to > anomaly_from: | |
| raise ValueError("the baseline window must end at or before anomaly_from") | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
56996b6 to
25d58a1
Compare
1752d82 to
06bae3f
Compare
There was a problem hiding this comment.
Two unresolved substantive bugs flagged in inline comments: (1) candidate-key discovery only scans the anomaly window, silently missing labels that disappeared during the anomaly (root-cause blind spot); (2) no validation that baseline_from < baseline_to, which can produce silently incorrect statistical output. Both are behavioral correctness issues in new production logic, not style nits.
POST /metrics/characterize compares an anomaly window against a baseline
(default: the equal-length window immediately before) and reports
magnitude (mean/peak/change ratio), direction, the onset bucket (first
bucket beyond 3 baseline stddevs with a 50% relative floor), and the
top label-value movers — drilling into up to four candidate keys,
auto-discovered from the metric's attributes when not given. The
aggregation auto-picks from the metric's OTel type (counter -> rate,
gauge -> avg, histogram -> p95). Exposed as the
characterize-metric-anomaly MCP tool whose prompt tells the agent how
to read the report and to pivot into logs/traces at onset_time.
How to validate manually:
- hogli test products/metrics/backend/tests/test_anomaly.py (8 tests)
- POST /api/environments/:id/metrics/characterize with
{"query": {"metricName": "metrics_rate_limiter_message_lag_seconds",
"anomalyFrom": "<30m ago>"}} returns direction/onset/top_movers
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
9ad6641 to
a4aecc5
Compare
79d8ba1 to
eebda3f
Compare
|
|
||
| # One query over the combined window keeps baseline and anomaly on a | ||
| # single grid; the interval comes from the combined span. | ||
| interval = _pick_interval(baseline_from, anomaly_to) |
There was a problem hiding this comment.
Medium: Unbounded metrics scans
A user with metrics:read can submit an anomalyFrom far in the past and omit candidateKeys; this endpoint then scans the selected metric once for the series, once to discover candidate keys, and up to four more times grouped by labels. Add a maximum combined baseline/anomaly window or query cost guard before running these queries so one API/MCP request cannot monopolize the metrics ClickHouse workload.
PR overviewThis PR adds a metric anomaly characterization capability, including a backend endpoint and corresponding MCP tool for analyzing anomalous metric ranges and related label candidates. There is one open security concern around resource consumption: a user with metrics read access can request a very large historical range and trigger multiple expensive ClickHouse scans in a single request. Because no issues have been addressed yet, the PR still needs a query window or cost guard to prevent one request from monopolizing metrics infrastructure. Open issues (1)
Fixed/addressed: 0 · PR risk: 6/10 |
7dca38d to
eebda3f
Compare

Problem
"Metric X is rising — why?" starts with three questions an agent (or human) currently has to answer with many manual queries: how big is the change, when did it start, and which label values moved. One endpoint should answer all three.
Changes
POST /metrics/characterize+ facadecharacterize_metric_anomaly()+ engine inproducts/metrics/backend/anomaly.py:up/down/flatwith 5% tolerance).service_name, or caller-provided) and ranks label values by behavior change.characterize-metric-anomalyMCP tool; its prompt tells agents how to read the report and to pivot into logs/traces atonset_time.Known limitation (documented):
change_ratiois confusing when the baseline mean is negative (seen with clock-skew message ages);direction/onset/top_moversare unaffected.How did you test this code?
I'm an agent. Automated:
hogli test products/metrics/backend/tests/test_anomaly.py(8 tests: magnitude/onset/movers on a seeded spike, auto key discovery, flat negative-control, window validation, counter auto-pick, API path). End-to-end: ran it against a real induced incident — stopped the local logs consumer ~3 min while traffic flowed, restarted it, called the tool over MCP: it reported directionup, onset within one bucket of the restart, andservice_name=logs-ingestionas the only mover. Reproduce: stopingestion-logsin the TUI for a few minutes, restart, call the tool withanomalyFrom= the stop time.🤖 Agent context
Autonomy: Human-driven (agent-assisted) — directed by @DanielVisca
Stats are deliberately simple (mean/stddev/threshold-onset) — robust enough for triage, no model dependencies; the report shape leaves room to upgrade the detector without changing the wire contract.