refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
Conversation
18676b6 to
323c1d1
Compare
There was a problem hiding this comment.
Pull request overview
Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.
Changes:
- Replaced
KernelNode.batch_load_live_statValkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a newLegacyLiveStatConverter. - Moved legacy
MetricValue/MovingStatValueintoai.backend.common.metrics.types, adding metric classifications andresolve_unit_hint()for unit derivation. - Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/api/gql_legacy/test_stat_converter.py | Adds unit tests for the legacy live_stat converter behavior. |
| src/ai/backend/manager/repositories/metric/repository.py | Minor refactor of Prometheus query result unpacking for live stats. |
| src/ai/backend/manager/clients/prometheus/fixed_query_builder.py | Removes unreachable guard in template selection for metric types. |
| src/ai/backend/manager/api/gql_legacy/statistics.py | Removes an unused/legacy batch loader wrapper method. |
| src/ai/backend/manager/api/gql_legacy/stat_converter.py | Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape. |
| src/ai/backend/manager/api/gql_legacy/kernel.py | Switches live_stat resolver to Prometheus action + legacy conversion via dataloader. |
| src/ai/backend/common/types.py | Removes legacy MetricValue/MovingStatValue from the common types module. |
| src/ai/backend/common/metrics/types.py | Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications. |
| src/ai/backend/common/clients/valkey_client/valkey_stat/client.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/client/output/formatters.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/appproxy/worker/types.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| src/ai/backend/agent/stats.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| changes/11330.enhance.md | Adds changelog entry for the migration. |
Comments suppressed due to low confidence (1)
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154
_get_template()no longer has a fallback branch. If an unexpected/newMetricTypevalue ever reaches this function (e.g., enum extended in the future), Python will implicitly returnNone, which then propagates intoMetricPreset(template=...)and fails later with a less clear error. Please add an explicit default case that raises (e.g.,UnreachableError/ValueError) so failures are immediate and actionable.
def _get_template(self, metric_type: MetricType) -> str:
match metric_type:
case MetricType.GAUGE:
return _GAUGE_TEMPLATE
case MetricType.RATE:
return _RATE_TEMPLATE
case MetricType.DIFF:
return _DIFF_TEMPLATE
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cde8b1f to
1a2dbeb
Compare
| Merge order from upstream is gauge -> diff -> rate, so for | ||
| RATE/DIFF metrics the same `(name, CURRENT)` tuple appears twice; | ||
| `currents[0]` is the raw gauge sample, `currents[-1]` is the | ||
| rate/diff query result. |
There was a problem hiding this comment.
In a future PR, I plan to refactor the code so that it can be converted from an index-based to a type-based approach. Since this would require modifying the response merge logic, I did not include it in this PR.
| graph_ctx, self.batch_load_live_stat | ||
| graph_ctx, _batch_load_kernel_live_stat | ||
| ) | ||
| return cast(dict[str, Any] | None, await loader.load(self.row_id)) | ||
|
|
||
| @classmethod | ||
| async def batch_load_live_stat( | ||
| cls, ctx: GraphQueryContext, kernel_ids: Sequence[KernelId] | ||
| ) -> list[dict[str, Any] | None]: | ||
| kernel_ids_str = [str(kid) for kid in kernel_ids] | ||
| return await ctx.valkey_stat.get_session_statistics_batch(kernel_ids_str) | ||
|
|
There was a problem hiding this comment.
It seems like code that doesn't really need to be changed is being modified, and this makes it difficult to read the code.
5c54cab to
8095984
Compare
Summary
KernelNode.batch_load_live_stat(Valkey →valkey_stat.get_session_statistics_batch) is replaced by_batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a newLegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.MetricValue/MovingStatValuemove fromcommon/types.pytocommon/metrics/types.pynext to the newRATE_STAT_METRICS/DIFF_STAT_METRICSclassifications andresolve_unit_hint()helper (with naming-convention fallback so plugin metrics still get a usable unit hint).LegacyLiveStatConverterunit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.Known wire-level gaps
The remaining gaps are addressed by follow-up PRs that wire additional sources into the same converter; this PR's converter will pick them up as those land.
current/pct(with capacity) /unit_hint/stats.diff/stats.ratecapacityforcpu_used/net_*/io_*(and dependentpct)CAPACITYsample as-is, so it picks up the synthesized value automatically once #11535 lands.stats.max(all metrics) /stats.avg(cpu_util,cuda_util, plugin accel metrics)*_mem/*_util/*_power/*_temperature), socuda_*and other accelerator families getstats.max/stats.avgfor free. Converter wiring (mapping newvalue_type=max/value_type=avgsamples into the legacystats.max/stats.avgslots) is the remaining piece on top of #11360.*_clock/*_voltage)_ACCEL_GAUGE_SUFFIXES_*incommon/clients/prometheus/metric_types.py).Test plan
pants test tests/unit/manager/api/gql_legacy/test_stat_converter.pypants checkon the modified filespants fmt/pants lintcleanscripts/test-live-stat-equivalence.shagainst a live session (stash → A label, pop → B label, difflive-stat-eqv/Avslive-stat-eqv/B)🤖 Generated with Claude Code
📚 Documentation preview 📚: https://sorna--11330.org.readthedocs.build/en/11330/
📚 Documentation preview 📚: https://sorna-ko--11330.org.readthedocs.build/ko/11330/