Skip to content

Support RDMA NIC sharing #63

@gauravkghildiyal

Description

@gauravkghildiyal

We want to support sharing a single RDMA NIC across multiple pods

Context: (Based on a discussion on slack) There are use cases within LLM inference which require scheduling multiple pods on a single Node which consume accelerator (GPU/TPU). The Node itself may only have one RDMA device, which means for multi-node inference, the RDMA device would need to be shared between all pods on the single node to allow RDMA traffic to other nodes.

E.g.

CPU / PCIe Root Complex 0
 ├── GPU0
 ├── GPU1
 ├── GPU2
 ├── GPU3
 └── NIC0

Possible solution:

  1. Create IPVLANs for each pod requesting an RDMA NIC to share the underlying RDMA device.
  2. Use the Consumable Capacity (kep.k8s.io/5075) in DRA to model this resource sharing capability.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions