Skip to content

Conversation

@rebel-mskim
Copy link
Contributor

Motivation

Rebellions’ NPU exporter currently emits only bare device metrics. Operators need better visibility into which Kubernetes pod owns each device and whether the device is healthy, including driver/firmware metadata, so they can troubleshoot workloads quickly.

Summary of Changes

  • add a kubelet-based pod resource mapper, node-name detection, and shared label definitions so all collectors can tag metrics with pod/namespace/container/hostname and device version info
  • extend the daemon client and existing collectors to populate the enriched labels, and introduce a device health collector plus factory wiring to export per-device health status gauges

Technical Details

  • wire the new PodResourceMapper and node name through config → collector factory → hardware/memory/utilization collectors, ensuring Gauges use the expanded label set consistently
  • augment the daemon client with version info and GetTotalInfo lookups so DeviceStatus carries driver/firmware/SMC versions and health status, which the new DeviceHealthCollector exposes via RBLN_DEVICE_STATUS:HEALTH

@rebel-mskim rebel-mskim self-assigned this Nov 18, 2025
@rebel-mskim rebel-mskim changed the title feat(metrics): enrich device metrics with pod mapping and health status feat(metrics): enrich device metrics with pod mapping and health status Nov 18, 2025
@rebel-mskim rebel-mskim force-pushed the feat/pod-resource-labels branch 2 times, most recently from 43bbe5b to cd3e48e Compare November 18, 2025 11:38
@rebel-mskim rebel-mskim force-pushed the feat/pod-resource-labels branch from cd3e48e to 8d1a0ce Compare November 19, 2025 02:22
@rebel-mskim rebel-mskim merged commit 59c5fad into rebellions-sw:main Nov 19, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant