Saw the following CD daemon log excerpt in a user's environment:
Flags: (*main.Flags)({
cliqueID: (string) "",
computeDomainUUID: (string) "",
computeDomainName: (string) "",
computeDomainNamespace: (string) (len=7) "default",
...
})
I1113 01:08:57.594698 1 main.go:210] no cliqueID: register with ComputeDomain, but do not run IMEX daemon
Error: writeIMEXConfig failed: error parsing template file: open /imexd/imexd.cfg.tmpl: no such file or directory
That seems to be consistent with CDI not being enabled. CDI not being enabled results in the following container edits to not apply:
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-kubelet-plugin/computedomain.go#L172
Also relevant: https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-daemon/main.go#L118
We should make it easier for users to detect when there is a problem with CDI (when it's not enabled, or when it's not working). So that they have a chance to fix that problem themselves without having to reach out for support.
For starters, I propose to build a pragmatic CDI feature check into the system: we could set an environment variable in GetCDIContainerEditsCommon() (hard-coded key/value pair). In the CD daemon if that environment variable isn't set we can crash-loop the CD daemon with an error message like "CDI container edits did not apply -- is CDI enabled in your system?"
Saw the following CD daemon log excerpt in a user's environment:
That seems to be consistent with CDI not being enabled. CDI not being enabled results in the following container edits to not apply:
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-kubelet-plugin/computedomain.go#L172
Also relevant: https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-daemon/main.go#L118
We should make it easier for users to detect when there is a problem with CDI (when it's not enabled, or when it's not working). So that they have a chance to fix that problem themselves without having to reach out for support.
For starters, I propose to build a pragmatic CDI feature check into the system: we could set an environment variable in
GetCDIContainerEditsCommon()(hard-coded key/value pair). In the CD daemon if that environment variable isn't set we can crash-loop the CD daemon with an error message like "CDI container edits did not apply -- is CDI enabled in your system?"