Skip to content

Conversation

@vimalk78
Copy link
Collaborator

Document vendor-agnostic GPU power monitoring architecture:

  • NVIDIA NVML backend with per-process attribution via SM%
  • Registry pattern for multi-vendor support (AMD ROCm, Intel Level Zero)
  • Kubelet pod-resources API for GPU-to-pod mapping
  • Idle power auto-detection and energy attribution math
  • GPU sharing modes (exclusive, time-slicing, MIG, MPS)

Includes Grafana screenshots showing node, per-process, and idle power.

@github-actions github-actions bot added the docs Documentation changes label Dec 10, 2025
SamYuan1990
SamYuan1990 previously approved these changes Dec 12, 2025
Copy link
Collaborator

@SamYuan1990 SamYuan1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

NVIDIA MIG partitions a physical GPU into isolated instances, each with dedicated compute and memory. NVML reports each MIG instance as a separate device with its own UUID. Our implementation handles this naturally:
- `meter.Devices()` returns MIG instances as separate devices
- Each instance has independent power monitoring
- No special handling required - works out of the box
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the claim verified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. not verified in reference implementation. for kepler it would be separate instances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU power is attributed to processes proportionally by compute utilization:

```text
active_power = total_power - idle_power
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this scenario? In an AI based workload GPU's may have constant stream of requests even low never droping to true idle state so in that case idle_power will be equal to total_power making active_power = 0. Correct me if I am wrong here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no ideal idle power handling, thats why it is mentioned in Open Questions section.
we could configure the idle power as a configuration, or let kepler detect idle power by letting gpu go idle for sometime. there is no third way.

@vimalk78 vimalk78 force-pushed the propose-gpu-power branch 2 times, most recently from 91f334e to 9a0f3ca Compare December 15, 2025 07:01
sunya-ch
sunya-ch previously approved these changes Dec 15, 2025
Copy link
Collaborator

@sunya-ch sunya-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Please fix the markdownlint issues.

  Add comprehensive design document for GPU power monitoring feature:
  - Vendor-agnostic architecture with pluggable backends
  - Per-process energy attribution based on compute utilization (SM%)
  - Kubernetes GPU allocation mapping via kubelet pod-resources API
  - Support for GPU sharing modes (exclusive, time-slicing, MIG, MPS)
  - Detailed metrics specification for node/process/container/pod levels
  - Idle power detection and attribution logic

Signed-off-by: Vimal Kumar <[email protected]>
Copy link
Collaborator

@SamYuan1990 SamYuan1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants