Skip to content

Support mode with long-running, cluster-wide deployment of ComputeDomain daemons #920

@klueska

Description

@klueska

Right now, nvidia-imex daemons live and die with the ComputeDomain that spawns them. When a ComputeDomain is created, the controller provisions a per-CD DaemonSet running compute-domain-daemon pods on each labeled node, and those pods go away when the CD is deleted. This works, but it means every new workload pays a cold-start cost to bootstrap the IMEX domain, and any disruption to the ComputeDomain object or its DaemonSet tears the whole thing down and forces a rebuild.

We want to support a mode where IMEX daemons run as a permanent, cluster-wide DaemonSet deployed at driver install time alongside the GPU kubelet plugin, running continuously on all NVLink-capable nodes regardless of whether any workload is active. In this mode a new ComputeDomain would attach to already-running daemons rather than spinning up fresh ones, eliminating the cold-start overhead and decoupling daemon uptime from workload resource object lifecycle.

The core architectural change is separating the IMEX daemon lifecycle from the ComputeDomain lifecycle entirely. Rather than scoping daemon identity to a per-workload CD, we'd ground it in hardware topology — NVLink clique membership, which is stable. The ComputeDomainCliques infrastructure we've already built is the natural anchor here: cliques are fixed hardware groupings, and a persistent daemon-per-clique-per-node is a coherent model.

That said, there are real design questions to work through. In this mode, multiple ComputeDomain objects on the same nodes would share the same underlying IMEX daemon(s), so the semantics of CD creation and deletion need to change.

Channel allocation is another open question. Today every ComputeDomain uses channel 0, which works because each CD gets its own isolated set of IMEX daemons. In long-running mode, where a single set of daemons serves multiple concurrent ComputeDomain objects, each CD will need its own unique channel number, which means we'll need a cluster-wide mechanism to assign and track them.

The cliqueID assignment and node labeling that currently flows through CDI injection into daemon pods will also need a new home, most likely moved into the permanent DaemonSet configuration directly. This work also ties closely into supporting multiple ComputeDomain objects on the same node simultaneously, which is a related open topic.

With anything we do, we want to stay fully Kubernetes-native, with daemon pods managed by the Kubernetes pod lifecycle. The existing model of supervising nvidia-imex as a containerized child process inside compute-domain-daemon works well and should carry over intact.

Please see the following issues for some of the motivating reasons for introducing this change:

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions