generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
We want to support sharing a single RDMA NIC across multiple pods
Context: (Based on a discussion on slack) There are use cases within LLM inference which require scheduling multiple pods on a single Node which consume accelerator (GPU/TPU). The Node itself may only have one RDMA device, which means for multi-node inference, the RDMA device would need to be shared between all pods on the single node to allow RDMA traffic to other nodes.
E.g.
CPU / PCIe Root Complex 0
├── GPU0
├── GPU1
├── GPU2
├── GPU3
└── NIC0
Possible solution:
- Create IPVLANs for each pod requesting an RDMA NIC to share the underlying RDMA device.
- Use the Consumable Capacity (kep.k8s.io/5075) in DRA to model this resource sharing capability.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels