-
Notifications
You must be signed in to change notification settings - Fork 222
docs(proposal): add EP-003 GPU Power Monitoring enhancement proposal #2367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
SamYuan1990
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| NVIDIA MIG partitions a physical GPU into isolated instances, each with dedicated compute and memory. NVML reports each MIG instance as a separate device with its own UUID. Our implementation handles this naturally: | ||
| - `meter.Devices()` returns MIG instances as separate devices | ||
| - Each instance has independent power monitoring | ||
| - No special handling required - works out of the box |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the claim verified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. not verified in reference implementation. for kepler it would be separate instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how NVML works for MIG based instance considering its legacy https://massedcompute.com/faq-answers/?question=What%20is%20the%20difference%20between%20NVIDIA%20DCGM%20and%20NVML%20in%20terms%20of%20monitoring%20and%20management%20capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| GPU power is attributed to processes proportionally by compute utilization: | ||
|
|
||
| ```text | ||
| active_power = total_power - idle_power |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this scenario? In an AI based workload GPU's may have constant stream of requests even low never droping to true idle state so in that case idle_power will be equal to total_power making active_power = 0. Correct me if I am wrong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no ideal idle power handling, thats why it is mentioned in Open Questions section.
we could configure the idle power as a configuration, or let kepler detect idle power by letting gpu go idle for sometime. there is no third way.
91f334e to
9a0f3ca
Compare
sunya-ch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Please fix the markdownlint issues.
9a0f3ca to
dbe7fea
Compare
Add comprehensive design document for GPU power monitoring feature: - Vendor-agnostic architecture with pluggable backends - Per-process energy attribution based on compute utilization (SM%) - Kubernetes GPU allocation mapping via kubelet pod-resources API - Support for GPU sharing modes (exclusive, time-slicing, MIG, MPS) - Detailed metrics specification for node/process/container/pod levels - Idle power detection and attribution logic Signed-off-by: Vimal Kumar <[email protected]>
dbe7fea to
722282a
Compare
SamYuan1990
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Document vendor-agnostic GPU power monitoring architecture:
Includes Grafana screenshots showing node, per-process, and idle power.