Skip to content

Conversation

@gabrielcocenza
Copy link
Member

@gabrielcocenza gabrielcocenza commented Sep 19, 2025

Implementation of v4 and the possible dcgm snap channel configuration:

Automatic detection of dcgm when using auto

  • If driver is cuda10 compatible , run the v3/stable, log that is necessary to upgrade to a newer driver version
  • If the driver is in v3 and cuda11 or 12, block and ask to use v4
  • If driver is on cuda11 on v4, warn users to upgrade to a driver version compatible with cuda12

V3

  • if the machine driver is cuda13 compatible, block the unit and ask for manual intervention
  • if driver is cuda10 compatible , run the v3/stable, log that is necessary to upgrade to a newer driver version
  • If driver is on cuda11 on v3, warn users to upgrade to a driver version compatible with cuda12
  • If the unit is on v4, block the unit and ask for manual intervention

V4

  • Behavior similar with auto, however it would block a unit if cuda10 driver is detected
  • If driver is on cuda11 on v4, warn users to upgrade to a driver version compatible with cuda12
  • If on v3, block unit and ask to use v4 channel

General changes

  • added retry for the check in snap exporter. Discovered that sometimes when refreshing and checking in short period of time, services might not be ready yet which gives false status that the exporter is failing

With this change the default value will be set to auto. With that existent deployments using v3 will be updated to the corresponding version on v4

Note: The auto logic is installing from edge because there isn't yet stable for v4 tracks

Implementation of : #465

- check for comparibility of v3 and v4
@gabrielcocenza gabrielcocenza marked this pull request as draft September 29, 2025 22:51
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

Would like to address the life-cycle issue/question first if possible.

@gabrielcocenza gabrielcocenza self-assigned this Sep 30, 2025
@gabrielcocenza gabrielcocenza marked this pull request as ready for review September 30, 2025 20:38
jneo8
jneo8 previously approved these changes Oct 1, 2025
- add comments
- adapt unit tests
chanchiwai-ray
chanchiwai-ray previously approved these changes Oct 2, 2025
Copy link
Contributor

@chanchiwai-ray chanchiwai-ray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

  dcgm_v3_compatible
- automatic channel selection just for auto
Copy link
Contributor

@Pjack Pjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gabrielcocenza gabrielcocenza merged commit 089075a into canonical:main Oct 3, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants