-
Notifications
You must be signed in to change notification settings - Fork 17
Use dcgm v4 on HWO #466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dcgm v4 on HWO #466
Conversation
- automatic dedetection to use the right channel for dcgm based on driver version - check if dcgm channel is valid
- check for comparibility of v3 and v4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
Would like to address the life-cycle issue/question first if possible.
- increase coverage
the driver is not installed or loaded
- add comments - adapt unit tests
31419c1 to
e4e50fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
a7b250a to
8cf59a6
Compare
dcgm_v3_compatible - automatic channel selection just for auto
8cf59a6 to
3826007
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Implementation of v4 and the possible dcgm snap channel configuration:
Automatic detection of dcgm when using
autoV3
V4
auto, however it would block a unit if cuda10 driver is detectedGeneral changes
checkin snap exporter. Discovered that sometimes when refreshing and checking in short period of time, services might not be ready yet which gives false status that the exporter is failingWith this change the default value will be set to
auto. With that existent deployments using v3 will be updated to the corresponding version on v4Note: The auto logic is installing from
edgebecause there isn't yetstablefor v4 tracksImplementation of : #465