Skip to content

CD daemon: make it easier to see when CDI edits did not apply #720

@jgehrcke

Description

@jgehrcke

Saw the following CD daemon log excerpt in a user's environment:

Flags: (*main.Flags)({
  cliqueID: (string) "",
  computeDomainUUID: (string) "",
  computeDomainName: (string) "",
  computeDomainNamespace: (string) (len=7) "default",
...
})
I1113 01:08:57.594698       1 main.go:210] no cliqueID: register with ComputeDomain, but do not run IMEX daemon
Error: writeIMEXConfig failed: error parsing template file: open /imexd/imexd.cfg.tmpl: no such file or directory

That seems to be consistent with CDI not being enabled. CDI not being enabled results in the following container edits to not apply:
https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-kubelet-plugin/computedomain.go#L172

Also relevant: https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/55fc7b0921da59c6b5ac9c843372baffa0b1eeb7/cmd/compute-domain-daemon/main.go#L118

We should make it easier for users to detect when there is a problem with CDI (when it's not enabled, or when it's not working). So that they have a chance to fix that problem themselves without having to reach out for support.

For starters, I propose to build a pragmatic CDI feature check into the system: we could set an environment variable in GetCDIContainerEditsCommon() (hard-coded key/value pair). In the CD daemon if that environment variable isn't set we can crash-loop the CD daemon with an error message like "CDI container edits did not apply -- is CDI enabled in your system?"

Metadata

Metadata

Labels

debuggabilityissue/pr related to the ability to debug the system

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions