-
Notifications
You must be signed in to change notification settings - Fork 754
Add env var DP_ENABLE_HEALTHCHECKS #1335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,6 +32,7 @@ const ( | |
| // disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that | ||
| // this is in addition to the Application errors that are already ignored. | ||
| envDisableHealthChecks = "DP_DISABLE_HEALTHCHECKS" | ||
| envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS" | ||
| allHealthChecks = "xids" | ||
| ) | ||
|
|
||
|
|
@@ -45,6 +46,8 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |
| return nil | ||
| } | ||
|
|
||
| enableHealthChecks := strings.ToLower(os.Getenv(envEnableHealthChecks)) | ||
|
||
|
|
||
| ret := r.nvml.Init() | ||
| if ret != nvml.SUCCESS { | ||
| if *r.config.Flags.FailOnInitError { | ||
|
|
@@ -80,6 +83,12 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |
| skippedXids[additionalXid] = true | ||
| } | ||
|
|
||
| for _, additionalXid := range getAdditionalXids(enableHealthChecks) { | ||
|
||
| delete(skippedXids, additionalXid) | ||
| } | ||
|
|
||
| klog.Infof("Health checks are disabled for xids: %v", skippedXids) | ||
|
|
||
| eventSet, ret := r.nvml.EventSetCreate() | ||
| if ret != nvml.SUCCESS { | ||
| return fmt.Errorf("failed to create event set: %v", ret) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this
DP_FORCE_HEALTHCHECKSorCRITICAL_XIDSorFATAL_XIDSinstead? (I would prefer the latter, but I could hear arguments for alternatives.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DP_DISABLE_HEALTHCHECKSis the exact opposite ofDP_ENABLE_HEALTHCHECKSand IMO the naming should reflect this. So I'd favour renamingDP_ENABLE_HEALTHCHECKStoDP_FORCE_ENABLE_HEALTHCHECKS, that way you get the "force" but it's still very clear it's the exact opposite ofDP_DISABLE_HEALTHCHECKS.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: What is the order of precedence of these? If
DP_DISABLE_HEALTHCHECKS=all|xidsdoes adding a specific XID toDP_FORCE_ENABLE_HEALTHCHECKStake precedence over this?Then a comment on the naming. I don't think this envvar is the oposite of the existing one. While the existing one removes the predefined XIDs that we treat as fatal, this one adds ADDITIONAL ones.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On balance I'd make
DP_ADDITIONAL_HEALTHCHECKS/DP_FORCE_ENABLE_HEALTHCHECKSsupportall, and have it overrideDP_DISABLE_HEALTHCHECKS. This matters more if we choose a name withFORCEin it, but is probably worth doing regardless of name. This would complicate the code a little, but would make the externally visible behavior to be what a reasonable user would expect.