-
Notifications
You must be signed in to change notification settings - Fork 8
Description
We want to manage a scenario where users request the attachment of resources for specific nodes. We start from a state where no resources are attached to any of the nodes in the cluster.
Scenario
The following events take place in chronological order.
Events:
-
I am user A and I want to attach 2 A100 GPUs on node1
-
I am user A and I want to attach 2 V100 GPUs on node2
-
I am user B and I want to attach 2 V100 GPUs on node3
-
I am a user B and I want to attach 2 additional A100 GPUs on node 1
-
I am a user A and I want to attach 2 additional A100 GPUs on node 1
In this scenario, we have additional constraints compared to the case when we define ComposabilityRequests without target_node (users specify the specific node for the resource attachment).
Potential Approaches
For the first event, user A will creates a ComposabilityRequest as following
CR1:
type: gpu
size: 2
model:A100
target_node: node1
The is handled by the operator requesting the CDI Manager the attachment of 2 GPUs A100 on node1
For the second event, user A creates a ComposabilityRequest as following:
CR2:
type:gpu
size: 2
model:V100
target_node: node2
The request is handled like the first one as we do not have any conflicts.
For the third event, user B creates a ComposabilityRequest as following:
CR3:
type: gpu
size: 2
model: V100
target_node: node3
As before, the request can be handled without any issues because none of the GPUs are attached on node3
For event 4, things starts t get interesting and we need to decide how to handled this situation. User B wants to connect 2 A100 on node1 but we already have a request (CR1).
We have two options in my opinion:
Option1:
- we prevent the user from creating an additional CR,
- we return an error message stating that a CR requesting the attachment of A100 on node1 already exists (CR1)
- user B will have to update CR1 setting
sizefrom 2 to 4
Considerations:
- user B might not have visibility or the permission to view or edit CR1
Option 2:
- User B creates a CR4 as below
- the operator forwards the request to the CDI Manager
- In case of deletion of CR4 or CR1, the oprator can detach the correct GPUs if they know a reference to the physical device or an identifier passed by the CDI Manager. (Such information is already available in the new design proposed by @hase1128 and
lei zhangl)
CR4:
type: gpu
size: 2
model: A100
target_node: node1
Fundamentally, the difference between the two options is in how many constraints we want to set to the proliferation of ComposabilityRequest objects. With Option1 users cannot create more than one CR with the same type, model, target_node.
Event 5 should be handled using the same approach selected for event 4.