Support mode with long-running, cluster-wide deployment of ComputeDomain daemons

Right now, `nvidia-imex` daemons live and die with the `ComputeDomain` that spawns them. When a `ComputeDomain` is created, the controller provisions a per-CD DaemonSet running `compute-domain-daemon` pods on each labeled node, and those pods go away when the CD is deleted. This works, but it means every new workload pays a cold-start cost to bootstrap the IMEX domain, and any disruption to the `ComputeDomain` object or its DaemonSet tears the whole thing down and forces a rebuild.

We want to support a mode where IMEX daemons run as a permanent, cluster-wide DaemonSet deployed at driver install time alongside the GPU kubelet plugin, running continuously on all NVLink-capable nodes regardless of whether any workload is active. In this mode a new `ComputeDomain` would attach to already-running daemons rather than spinning up fresh ones, eliminating the cold-start overhead and decoupling daemon uptime from workload resource object lifecycle.

The core architectural change is separating the IMEX daemon lifecycle from the `ComputeDomain` lifecycle entirely. Rather than scoping daemon identity to a per-workload CD, we'd ground it in hardware topology — NVLink clique membership, which is stable. The `ComputeDomainCliques` infrastructure we've already built is the natural anchor here: cliques are fixed hardware groupings, and a persistent daemon-per-clique-per-node is a coherent model.

That said, there are real design questions to work through. In this mode, multiple `ComputeDomain` objects on the same nodes would share the same underlying IMEX daemon(s), so the semantics of CD creation and deletion need to change.

Channel allocation is another open question. Today every `ComputeDomain` uses channel 0, which works because each CD gets its own isolated set of IMEX daemons. In long-running mode, where a single set of daemons serves multiple concurrent `ComputeDomain` objects, each CD will need its own unique channel number, which means we'll need a cluster-wide mechanism to assign and track them.

The `cliqueID` assignment and node labeling that currently flows through CDI injection into daemon pods will also need a new home, most likely moved into the permanent DaemonSet configuration directly. This work also ties closely into supporting multiple `ComputeDomain` objects on the same node simultaneously, which is a related open topic.

With anything we do, we want to stay fully Kubernetes-native, with daemon pods managed by the Kubernetes pod lifecycle. The existing model of supervising `nvidia-imex` as a containerized child process inside `compute-domain-daemon` works well and should carry over intact.

Please see the following issues for some of the motivating reasons for introducing this change:
* https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/353
* https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/816

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mode with long-running, cluster-wide deployment of ComputeDomain daemons #920

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support mode with long-running, cluster-wide deployment of ComputeDomain daemons #920

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions