Skip to content

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330

Merged
HyeockJinKim merged 1 commit into
mainfrom
refactor/BA-5744
May 12, 2026
Merged

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
HyeockJinKim merged 1 commit into
mainfrom
refactor/BA-5744

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 27, 2026

Summary

  • KernelNode.batch_load_live_stat (Valkey → valkey_stat.get_session_statistics_batch) is replaced by _batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a new LegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.
  • MetricValue / MovingStatValue move from common/types.py to common/metrics/types.py next to the new RATE_STAT_METRICS / DIFF_STAT_METRICS classifications and resolve_unit_hint() helper (with naming-convention fallback so plugin metrics still get a usable unit hint).
  • New: LegacyLiveStatConverter unit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.

Known wire-level gaps

The remaining gaps are addressed by follow-up PRs that wire additional sources into the same converter; this PR's converter will pick them up as those land.

Field Status
current / pct (with capacity) / unit_hint / stats.diff / stats.rate ✅ legacy-equivalent
capacity for cpu_used / net_* / io_* (and dependent pct) 🔜 sentinel-synthesized in follow-up #11535 (BA-5806). The converter consumes the CAPACITY sample as-is, so it picks up the synthesized value automatically once #11535 lands.
stats.max (all metrics) / stats.avg (cpu_util, cuda_util, plugin accel metrics) 🔜 re-supplied from Prometheus in follow-up #11360 (BA-5878). Plugin metrics are covered by Backend.AI accelerator naming convention (*_mem / *_util / *_power / *_temperature), so cuda_* and other accelerator families get stats.max / stats.avg for free. Converter wiring (mapping new value_type=max / value_type=avg samples into the legacy stats.max / stats.avg slots) is the remaining piece on top of #11360.
Non-suffix-conforming new accelerator metric kinds (e.g., hypothetical *_clock / *_voltage) ⚠️ requires extending the accel-suffix list in #11360 (single-edit extension point: _ACCEL_GAUGE_SUFFIXES_* in common/clients/prometheus/metric_types.py).

Test plan

  • pants test tests/unit/manager/api/gql_legacy/test_stat_converter.py
  • pants check on the modified files
  • pants fmt / pants lint clean
  • A/B equivalence run with scripts/test-live-stat-equivalence.sh against a live session (stash → A label, pop → B label, diff live-stat-eqv/A vs live-stat-eqv/B)
  • WebUI smoke (kernel/session detail page → live_stat panel)

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://sorna--11330.org.readthedocs.build/en/11330/


📚 Documentation preview 📚: https://sorna-ko--11330.org.readthedocs.build/ko/11330/

@github-actions github-actions Bot added the size:XL 500~ LoC label Apr 27, 2026
@github-actions github-actions Bot added area:docs Documentations comp:manager Related to Manager component comp:agent Related to Agent component comp:client Related to Client component comp:common Related to Common component comp:app-proxy Related to App Proxy component labels Apr 27, 2026
seedspirit added a commit that referenced this pull request Apr 27, 2026
@seedspirit seedspirit marked this pull request as ready for review April 27, 2026 04:06
Copilot AI review requested due to automatic review settings April 27, 2026 04:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.

Changes:

  • Replaced KernelNode.batch_load_live_stat Valkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a new LegacyLiveStatConverter.
  • Moved legacy MetricValue / MovingStatValue into ai.backend.common.metrics.types, adding metric classifications and resolve_unit_hint() for unit derivation.
  • Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/manager/api/gql_legacy/test_stat_converter.py Adds unit tests for the legacy live_stat converter behavior.
src/ai/backend/manager/repositories/metric/repository.py Minor refactor of Prometheus query result unpacking for live stats.
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py Removes unreachable guard in template selection for metric types.
src/ai/backend/manager/api/gql_legacy/statistics.py Removes an unused/legacy batch loader wrapper method.
src/ai/backend/manager/api/gql_legacy/stat_converter.py Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape.
src/ai/backend/manager/api/gql_legacy/kernel.py Switches live_stat resolver to Prometheus action + legacy conversion via dataloader.
src/ai/backend/common/types.py Removes legacy MetricValue/MovingStatValue from the common types module.
src/ai/backend/common/metrics/types.py Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py Updates imports to use the moved MetricValue type.
src/ai/backend/client/output/formatters.py Updates imports to use the moved MetricValue type.
src/ai/backend/appproxy/worker/types.py Updates imports to use the moved MetricValue/MovingStatValue types.
src/ai/backend/agent/stats.py Updates imports to use the moved MetricValue/MovingStatValue types.
changes/11330.enhance.md Adds changelog entry for the migration.
Comments suppressed due to low confidence (1)

src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154

  • _get_template() no longer has a fallback branch. If an unexpected/new MetricType value ever reaches this function (e.g., enum extended in the future), Python will implicitly return None, which then propagates into MetricPreset(template=...) and fails later with a less clear error. Please add an explicit default case that raises (e.g., UnreachableError/ValueError) so failures are immediate and actionable.
    def _get_template(self, metric_type: MetricType) -> str:
        match metric_type:
            case MetricType.GAUGE:
                return _GAUGE_TEMPLATE
            case MetricType.RATE:
                return _RATE_TEMPLATE
            case MetricType.DIFF:
                return _DIFF_TEMPLATE


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/common/metrics/types.py Outdated
@seedspirit seedspirit requested a review from a team April 27, 2026 04:41
Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py
Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py Outdated
Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py
@jopemachine jopemachine requested a review from a team April 27, 2026 09:27
Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py Outdated
Comment on lines +21 to +24
Merge order from upstream is gauge -> diff -> rate, so for
RATE/DIFF metrics the same `(name, CURRENT)` tuple appears twice;
`currents[0]` is the raw gauge sample, `currents[-1]` is the
rate/diff query result.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, I plan to refactor the code so that it can be converted from an index-based to a type-based approach. Since this would require modifying the response merge logic, I did not include it in this PR.

Comment thread src/ai/backend/common/metrics/types.py Outdated
Comment on lines -193 to -203
graph_ctx, self.batch_load_live_stat
graph_ctx, _batch_load_kernel_live_stat
)
return cast(dict[str, Any] | None, await loader.load(self.row_id))

@classmethod
async def batch_load_live_stat(
cls, ctx: GraphQueryContext, kernel_ids: Sequence[KernelId]
) -> list[dict[str, Any] | None]:
kernel_ids_str = [str(kid) for kid in kernel_ids]
return await ctx.valkey_stat.get_session_statistics_batch(kernel_ids_str)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like code that doesn't really need to be changed is being modified, and this makes it difficult to read the code.

Comment thread src/ai/backend/common/metrics/types.py Outdated
@github-actions github-actions Bot added the size:L 100~500 LoC label May 12, 2026
@github-actions github-actions Bot removed the size:XL 500~ LoC label May 12, 2026
@HyeockJinKim HyeockJinKim merged commit 2901314 into main May 12, 2026
36 checks passed
@HyeockJinKim HyeockJinKim deleted the refactor/BA-5744 branch May 12, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:agent Related to Agent component comp:app-proxy Related to App Proxy component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants