docs(proposal): add EP-003 GPU Power Monitoring enhancement proposal #2367

vimalk78 · 2025-12-10T17:33:49Z

Document vendor-agnostic GPU power monitoring architecture:

NVIDIA NVML backend with per-process attribution via SM%
Registry pattern for multi-vendor support (AMD ROCm, Intel Level Zero)
Kubelet pod-resources API for GPU-to-pod mapping
Idle power auto-detection and energy attribution math
GPU sharing modes (exclusive, time-slicing, MIG, MPS)

Includes Grafana screenshots showing node, per-process, and idle power.

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

SamYuan1990

LGTM

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

vprashar2929 · 2025-12-12T11:52:12Z

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

+NVIDIA MIG partitions a physical GPU into isolated instances, each with dedicated compute and memory. NVML reports each MIG instance as a separate device with its own UUID. Our implementation handles this naturally:
+- `meter.Devices()` returns MIG instances as separate devices
+- Each instance has independent power monitoring
+- No special handling required - works out of the box


Is the claim verified?

no. not verified in reference implementation. for kepler it would be separate instances.

Not sure how NVML works for MIG based instance considering its legacy https://massedcompute.com/faq-answers/?question=What%20is%20the%20difference%20between%20NVIDIA%20DCGM%20and%20NVML%20in%20terms%20of%20monitoring%20and%20management%20capabilities?

MIG guide says different https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html#gpu-utilization-metrics, https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html#monitoring-mig-devices

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

vprashar2929 · 2025-12-12T12:04:42Z

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

+GPU power is attributed to processes proportionally by compute utilization:
+
+```text
+active_power = total_power - idle_power


Can we test this scenario? In an AI based workload GPU's may have constant stream of requests even low never droping to true idle state so in that case idle_power will be equal to total_power making active_power = 0. Correct me if I am wrong here?

there is no ideal idle power handling, thats why it is mentioned in Open Questions section.
we could configure the idle power as a configuration, or let kepler detect idle power by letting gpu go idle for sometime. there is no third way.

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md

sunya-ch

/lgtm

Please fix the markdownlint issues.

Add comprehensive design document for GPU power monitoring feature: - Vendor-agnostic architecture with pluggable backends - Per-process energy attribution based on compute utilization (SM%) - Kubernetes GPU allocation mapping via kubelet pod-resources API - Support for GPU sharing modes (exclusive, time-slicing, MIG, MPS) - Detailed metrics specification for node/process/container/pod levels - Idle power detection and attribution logic Signed-off-by: Vimal Kumar <[email protected]>

SamYuan1990

LGTM

github-actions bot added the docs Documentation changes label Dec 10, 2025

SamYuan1990 reviewed Dec 11, 2025

View reviewed changes

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md Outdated Show resolved Hide resolved

sunya-ch reviewed Dec 12, 2025

View reviewed changes

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md Show resolved Hide resolved

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md Show resolved Hide resolved

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md Outdated Show resolved Hide resolved

SamYuan1990 previously approved these changes Dec 12, 2025

View reviewed changes

ExplorerRay reviewed Dec 12, 2025

View reviewed changes

docs/developer/proposal/EP-003-GPU-Power-Monitoring.md Outdated Show resolved Hide resolved

vprashar2929 reviewed Dec 12, 2025

View reviewed changes

vimalk78 dismissed SamYuan1990’s stale review via 91f334e December 15, 2025 06:45

vimalk78 force-pushed the propose-gpu-power branch 2 times, most recently from 91f334e to 9a0f3ca Compare December 15, 2025 07:01

sunya-ch previously approved these changes Dec 15, 2025

View reviewed changes

vimalk78 dismissed sunya-ch’s stale review via dbe7fea December 15, 2025 07:25

vimalk78 force-pushed the propose-gpu-power branch from 9a0f3ca to dbe7fea Compare December 15, 2025 07:25

vimalk78 force-pushed the propose-gpu-power branch from dbe7fea to 722282a Compare December 15, 2025 07:26

SamYuan1990 approved these changes Dec 15, 2025

View reviewed changes

docs(proposal): add EP-003 GPU Power Monitoring enhancement proposal #2367

Are you sure you want to change the base?

docs(proposal): add EP-003 GPU Power Monitoring enhancement proposal #2367

Uh oh!

Conversation

vimalk78 commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SamYuan1990 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vprashar2929 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

vprashar2929 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

vprashar2929 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vprashar2929 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

vimalk78 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunya-ch left a comment

Choose a reason for hiding this comment

Uh oh!

SamYuan1990 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants