Skip to content

Define a policy to handle the creation of multiple ComposabilityRequests targeting the same node #6

@mgazz

Description

@mgazz

We want to manage a scenario where users request the attachment of resources for specific nodes. We start from a state where no resources are attached to any of the nodes in the cluster.

Scenario

The following events take place in chronological order.

Events:

  1. I am user A and I want to attach 2 A100 GPUs on node1

  2. I am user A and I want to attach 2 V100 GPUs on node2

  3. I am user B and I want to attach 2 V100 GPUs on node3

  4. I am a user B and I want to attach 2 additional A100 GPUs on node 1

  5. I am a user A and I want to attach 2 additional A100 GPUs on node 1

In this scenario, we have additional constraints compared to the case when we define ComposabilityRequests without target_node (users specify the specific node for the resource attachment).

Potential Approaches

For the first event, user A will creates a ComposabilityRequest as following

CR1: 
  type: gpu 
  size: 2
  model:A100 
  target_node: node1

The is handled by the operator requesting the CDI Manager the attachment of 2 GPUs A100 on node1

For the second event, user A creates a ComposabilityRequest as following:

CR2:
  type:gpu 
  size: 2
  model:V100 
  target_node: node2

The request is handled like the first one as we do not have any conflicts.

For the third event, user B creates a ComposabilityRequest as following:

CR3:
  type: gpu 
  size: 2
  model: V100 
  target_node: node3

As before, the request can be handled without any issues because none of the GPUs are attached on node3

For event 4, things starts t get interesting and we need to decide how to handled this situation. User B wants to connect 2 A100 on node1 but we already have a request (CR1).

We have two options in my opinion:

Option1:

  • we prevent the user from creating an additional CR,
  • we return an error message stating that a CR requesting the attachment of A100 on node1 already exists (CR1)
  • user B will have to update CR1 setting size from 2 to 4

Considerations:

  • user B might not have visibility or the permission to view or edit CR1

Option 2:

  • User B creates a CR4 as below
  • the operator forwards the request to the CDI Manager
  • In case of deletion of CR4 or CR1, the oprator can detach the correct GPUs if they know a reference to the physical device or an identifier passed by the CDI Manager. (Such information is already available in the new design proposed by @hase1128 and
    lei zhangl)
CR4: 
  type: gpu 
  size: 2
  model: A100
  target_node: node1

Fundamentally, the difference between the two options is in how many constraints we want to set to the proliferation of ComposabilityRequest objects. With Option1 users cannot create more than one CR with the same type, model, target_node.

Event 5 should be handled using the same approach selected for event 4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions