Skip to content
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions internal/rm/health.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ const (
// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
// this is in addition to the Application errors that are already ignored.
envDisableHealthChecks = "DP_DISABLE_HEALTHCHECKS"
envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this DP_FORCE_HEALTHCHECKS or CRITICAL_XIDS or FATAL_XIDS instead? (I would prefer the latter, but I could hear arguments for alternatives.

Copy link
Contributor Author

@robertdavidsmith robertdavidsmith Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP_DISABLE_HEALTHCHECKS is the exact opposite of DP_ENABLE_HEALTHCHECKS and IMO the naming should reflect this. So I'd favour renaming DP_ENABLE_HEALTHCHECKS to DP_FORCE_ENABLE_HEALTHCHECKS, that way you get the "force" but it's still very clear it's the exact opposite of DP_DISABLE_HEALTHCHECKS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: What is the order of precedence of these? If DP_DISABLE_HEALTHCHECKS=all|xids does adding a specific XID to DP_FORCE_ENABLE_HEALTHCHECKS take precedence over this?

Then a comment on the naming. I don't think this envvar is the oposite of the existing one. While the existing one removes the predefined XIDs that we treat as fatal, this one adds ADDITIONAL ones.

Copy link
Contributor Author

@robertdavidsmith robertdavidsmith Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On balance I'd make DP_ADDITIONAL_HEALTHCHECKS /DP_FORCE_ENABLE_HEALTHCHECKS support all, and have it override DP_DISABLE_HEALTHCHECKS. This matters more if we choose a name with FORCE in it, but is probably worth doing regardless of name. This would complicate the code a little, but would make the externally visible behavior to be what a reasonable user would expect.

allHealthChecks = "xids"
)

Expand All @@ -45,6 +46,8 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
return nil
}

enableHealthChecks := strings.ToLower(os.Getenv(envEnableHealthChecks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to wehre we actually use the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair comment, though if we add support for "all" this all changes anyway.


ret := r.nvml.Init()
if ret != nvml.SUCCESS {
if *r.config.Flags.FailOnInitError {
Expand Down Expand Up @@ -80,6 +83,12 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic
skippedXids[additionalXid] = true
}

for _, additionalXid := range getAdditionalXids(enableHealthChecks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow something like all to be specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, I'm happy to do this

delete(skippedXids, additionalXid)
}

klog.Infof("Health checks are disabled for xids: %v", skippedXids)

eventSet, ret := r.nvml.EventSetCreate()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to create event set: %v", ret)
Expand Down