Skip to content

Expose device UUIDs to node label#1116

Closed
xiongzubiao wants to merge 1 commit into
NVIDIA:mainfrom
xiongzubiao:uuid
Closed

Expose device UUIDs to node label#1116
xiongzubiao wants to merge 1 commit into
NVIDIA:mainfrom
xiongzubiao:uuid

Conversation

@xiongzubiao
Copy link
Copy Markdown

Closes #1015

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jan 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread internal/lm/nvml.go Outdated
@xiongzubiao xiongzubiao force-pushed the uuid branch 2 times, most recently from eb8fdec to 2ad1041 Compare January 10, 2025 01:49
@elezar
Copy link
Copy Markdown
Member

elezar commented Jan 10, 2025

@xiongzubiao could you please provide information on how these labels will be used?

@xiongzubiao
Copy link
Copy Markdown
Author

xiongzubiao commented Jan 10, 2025

@xiongzubiao could you please provide information on how these labels will be used?

@elezar, we want to provide some sort of visualization to user. User can click each GPU to check its properties, status, and metrics. The device UUID is the natural choice for indexing. There are other ways to get UUID, but it is most straightforward to get it from node labels, because it is a part of node properties.

There is another use case mentioned in #1015: scheduling pod to a specific GPU using node label matching.

Signed-off-by: Zubiao Xiong <zubiao.xiong@memverge.com>
@ramazanyich
Copy link
Copy Markdown

would like to have such a label in a scope of our project too. Where we need to see the GPU UUISs linked to node. So we can use that info in our workflows.

func NewDeviceMock(migEnabled bool) *DeviceMock {
d := DeviceMock{resource.DeviceMock{
GetNameFunc: func() (string, error) { return "MOCKMODEL", nil },
GetUUIDFunc: func() (string, error) { return "MOCKUUID", nil },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a negative test case too maybe. e.g. when we couldn't get the uuid.

Copy link
Copy Markdown
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rajatchopra
Copy link
Copy Markdown
Contributor

While the PR is appropriate and the implementation is good, we are not sure if we want to add extra labels unless necessary.
The use case of metrics lookup becoming easier isn't very convincing because GFD running as one-shot in replaceable VMs (as happens with GKE node pool break-fix) may result in stale UUIDs.
The right source of truth should be dcgm-exporter.

Closing the PR. Please re-open if anyone disagrees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add gpu uuids to node labels

5 participants