feat(metrics): group_by on a shared, zero-filled interval grid (P4)#63126
feat(metrics): group_by on a shared, zero-filled interval grid (P4)#63126DanielVisca wants to merge 1 commit into
Conversation
|
Size Change: 0 B Total Size: 72.4 MB ℹ️ View Unchanged
|
|
| default=list, | ||
| help_text="Labels to split the result into separate series by. Series share one time grid and are capped at the 100 largest.", | ||
| ) | ||
| interval = serializers.ChoiceField( |
There was a problem hiding this comment.
Medium: Unbounded metrics query aggregation
A user with metrics:read can submit interval: "second" over a very wide dateFrom/dateTo range, optionally with many groupBy entries, and force ClickHouse to build a high-cardinality grouped time grid. The LIMIT 10000 in the query and MAX_SERIES_PER_CLAUSE in the facade apply after aggregation, so they don't bound the work done by the database; reject explicit intervals whose bucket count exceeds a server-side cap and cap the number of group-by dimensions before constructing the query.
PR overviewThis pull request adds grouped metrics queries over a shared, zero-filled interval grid, enabling results to be bucketed consistently across time ranges and group-by dimensions. The backend metrics API is being updated to construct these interval-based aggregations for client requests. There is one open security concern: metrics readers can request very fine-grained intervals across large date ranges, potentially combined with multiple group-by fields, causing expensive high-cardinality aggregation work in ClickHouse before result limits take effect. This creates a concrete resource-exhaustion risk for the metrics backend until server-side caps are applied to bucket counts and group-by dimensions. No issues have been fixed yet, so the PR still needs a bounding control before it is in a good security state. Open issues (1)
Fixed/addressed: 0 · PR risk: 6/10 |
cf2ee47 to
8465408
Compare
d70e0ae to
22d92bf
Compare
8465408 to
9dc3731
Compare
22d92bf to
7762afe
Compare
There was a problem hiding this comment.
The bot reviewer raised a substantive unresolved concern: passing interval: "second" with a wide date range forces ClickHouse to aggregate across millions of time buckets before any limit applies — MAX_SERIES_PER_CLAUSE=100 and LIMIT 10000 only truncate the output, not the DB work. The current diff adds no bucket-count validation, so the vulnerability is still present.
🕸️ Eager graphHow much code each root forces the browser to download and decode through static imports — the regression class total bundle size can't see.
✅ Largest files eagerly reachable from
|
| Size | File |
|---|---|
| 878.2 KiB | src/styles/global.scss |
| 609.0 KiB | public/hedgehog/burning-money-hog.png |
| 541.9 KiB | public/hedgehog/waving-hog.png |
| 448.2 KiB | public/hedgehog/stop-sign-hog.png |
| 362.0 KiB | public/hedgehog/phone-pair-hogs.png |
| 354.8 KiB | ../node_modules/.pnpm/@posthog+icons@0.36.6_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/@posthog/icons/dist/posthog-icons.es.js |
| 335.6 KiB | public/hedgehog/desk-hog.png |
| 323.2 KiB | public/hedgehog/3-bears-hogs.png |
| 297.4 KiB | src/taxonomy/core-filter-definitions-by-group.json |
| 285.8 KiB | src/lib/api.ts |
Largest files eagerly reachable from src/scenes/AuthenticatedShell.tsx
| Size | File |
|---|---|
| 878.2 KiB | src/styles/global.scss |
| 760.0 KiB | src/queries/validators.js |
| 609.0 KiB | public/hedgehog/burning-money-hog.png |
| 541.9 KiB | public/hedgehog/waving-hog.png |
| 448.2 KiB | public/hedgehog/stop-sign-hog.png |
| 398.7 KiB | ../node_modules/.pnpm/chart.js@4.5.1/node_modules/chart.js/dist/chart.js |
| 362.0 KiB | public/hedgehog/phone-pair-hogs.png |
| 354.8 KiB | ../node_modules/.pnpm/@posthog+icons@0.36.6_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/@posthog/icons/dist/posthog-icons.es.js |
| 335.6 KiB | public/hedgehog/desk-hog.png |
| 323.2 KiB | public/hedgehog/3-bears-hogs.png |
Posted automatically by check-eager-graph · sizes are input-source bytes from the esbuild metafile · part of #32479
groupBy: [{key, scope}] splits a clause into one series per label-set.
All series share one bucket grid (union of observed buckets, zero-filled)
so cross-series and future cross-clause formula math line up; interval
moves onto the request (explicit choice or auto ~60 buckets). Series per
clause are capped at the 100 largest by summed |value|.
Also names the metrics scope/interval enums via ENUM_NAME_OVERRIDES
(scope collided between the filter and group-by serializers; the new
interval field surfaced a pre-existing BatchExport interval collision,
now BatchExportIntervalEnum).
How to validate manually:
- hogli test products/metrics/backend/tests/ (74 tests)
- with the local pipe running: POST /metrics/query with
{"groupBy": [{"key": "service_name"}], "interval": "minute_5"}
for nodejs_eventloop_lag_seconds returns one series per scraped
service on identical timestamps
9dc3731 to
4facf57
Compare
7762afe to
b3943ad
Compare
|
🎭 Playwright report
These issues are not necessarily caused by your changes. |

Problem
One series per query isn't observability — "break this down by pod/service/endpoint" is the core dashboard interaction, and formulas (next PRs) need every series on one time grid to be able to do math across them.
Changes
groupBy: [{key, scope}]splits a clause into one series per label value. Group columns are spliced into the HogQL AST (placeholders can't express a variable column count).intervalmoves onto the request (second…week, or auto ≈60 buckets across the range).ENUM_NAME_OVERRIDES: names the metrics interval enum and resolves a pre-existing BatchExportintervalcollision the new field surfaced (Interval837Enum→BatchExportIntervalEnum; generated-type rename only, no handwritten consumers — grep-verified).How did you test this code?
I'm an agent. Automated: grid/zero-fill/cap/ordering tests against real ClickHouse. Live check with the local pipe:
🤖 Agent context
Autonomy: Human-driven (agent-assisted) — directed by @DanielVisca
Zero-fill semantics: filled with 0.0 for all aggregations (chart- and formula-friendly); Prometheus-style absent-as-gap was considered and rejected for v1 since the formula evaluator needs aligned grids. Worth revisiting if avg-of-sparse-gauges confuses anyone.