Use SSA for conflict-free `status.nodes` list updates

Context: https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/816, https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/723

Quoting a high-level idea from https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/732:

> I think there are many different dimensions in solution space that we can explore. [...] And/or server-side applies: https://kubernetes.io/docs/reference/using-api/server-side-apply/ (I think that has a lot of potential)

Today, I started looking into how exactly we could leverage server-side apply (SSA) to achieve conflict-free patches from _many_ writers against the nodes list in an a ComputeDomain object. Without having to update the CD schema (because we also need a solution for 25.8.x, and here I think we really want to try hard getting away without changing the CRD).

Generally, SSA is explicitly designed so that controllers can avoid read-before-write and also avoid specifying resourceVersion. But for leveraging SSA properly, I thought, we probably also have to update the CRD. To make it more "SSA-compatible".

After looking at the details, I got my hopes up that we may actually be able to leverage SSA without having to change the current CD CRD schema.

One key insight is: SSA allows different owners to own different items in a list when the list type is set to `map`. Turns out, we already do this:

https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/0cefba8118b94195ecb0f15f2e6251b1206eebd3/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml#L146C1-L147C23

Specifically, the `status.nodes` list is already configured as `x-kubernetes-list-type: map`. That precisely means that different managers can manage different entries of that list.

We use `x-kubernetes-list-map-keys: name` which means that SSA can merge by node name -- so each writer can really own its own node-specific item in the list. 

What do we gain? Conflict-free updates, basically. That would be huge.
What does this cost? 
- Each owner/manager creates a little bit of overhead (at least it's annotated in the meta data of the object -- not sure yet how impactful that is)
- The unknowns -- could any distributed system bugs creep by using this strategy? What if a manager/owner is lost? ... (I think we're good, but this may be important to think through anyway)

It seems like we can decide between:
- have as many managers as unique clique IDs (and allow for smaller-scale conflict resolution as before)
- have as many managers as nodes (and prevent any conflicts)

I haven't tried this out yet -- but it seems like code changes can actually be pretty simple. The gist of it:

### 1) create node-specific patch

A patch would really just specify an individual nodes list item, and it doesn't need to specify a resource version. So, a patch would be only containing the new payload -- something like 
```
{
  "status": {
    "nodes": [{
      "name": "node-1",
      "cliqueID": "clique-1", 
      "ipAddress": "1.1.1.1",
      "index": 0,
      "status": "Ready"
    }]
  }
}
```

The magic can happen through `"name": "node-1"` -- only _one_ owner may exist for a list item with this k/v pair, and that owner may update this item at will.

Notably, to create that patch we probably do not need to re-fetch the most recent version of the object. 

### 2) PATCH object with specific fieldManager

Let's see about the specific code, but it can be a small change. We'd use Patch() instead of Update()

Resources:

- https://kubernetes.io/docs/reference/using-api/server-side-apply/
- https://github.com/kubernetes-sigs/controller-runtime/issues/1748#issuecomment-995247064
- https://stackoverflow.com/questions/78527849/field-conflict-with-kubernetes-dynamic-client-and-server-side-apply





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SSA for conflict-free `status.nodes` list updates #821

1) create node-specific patch

2) PATCH object with specific fieldManager

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use SSA for conflict-free status.nodes list updates #821

Description

1) create node-specific patch

2) PATCH object with specific fieldManager

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Use SSA for conflict-free `status.nodes` list updates #821