Skip to content

Commit 8af37c6

Browse files
committed
address comments
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
1 parent d1b3d60 commit 8af37c6

1 file changed

Lines changed: 8 additions & 8 deletions

File tree

  • kars/0010-high-performance-pod-to-pod-communication

kars/0010-high-performance-pod-to-pod-communication/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Forward-looking: Once the network resource supports DRA, then the platform shoul
1010

1111
AI/ML workloads, particularly distributed training and inference, require high-throughput, low-latency pod-to-pod communication. These workloads often rely on specialized network hardware (e.g., RDMA-capable NICs, high-speed interconnects) that must be explicitly attached to pods. Without a standardized mechanism for discovering and allocating these network resources, users face fragmented tooling, inconsistent behavior across platforms, and difficulty ensuring workloads are scheduled on nodes with the appropriate network capabilities.
1212

13-
By leveraging Dynamic Resource Allocation (DRA) for network resources, platforms can provide a consistent, Kubernetes-native way to expose high-performance network interfaces to pods. This enables workloads to discover available network characteristics and make informed scheduling decisions, improving portability and reducing the operational burden of running distributed AI/ML workloads on Kubernetes.
13+
By leveraging Dynamic Resource Allocation (DRA) for network resources, platforms can provide a consistent, Kubernetes-native way to expose secondary network interfaces to pods. This enables workloads to discover available network characteristics and make informed scheduling decisions, improving portability and reducing the operational burden of running distributed AI/ML workloads on Kubernetes.
1414

1515
## Graduation Criteria
1616

@@ -29,19 +29,19 @@ By leveraging Dynamic Resource Allocation (DRA) for network resources, platforms
2929

3030
Validate the following observable outcomes:
3131

32-
1. **Multiple network interfaces are available to pods:** A pod scheduled on a node with high-performance network hardware has access to additional network interfaces beyond the default pod network.
33-
2. **Network resource characteristics are discoverable:** The characteristics of available high-performance network resources (e.g., interface type, bandwidth, RDMA capability) are published and queryable within the cluster, enabling workloads and schedulers to make informed decisions.
34-
3. **Pods are scheduled to nodes with the required network resources:** When a workload requires a specific high-performance network capability, it is scheduled only on nodes where that capability is available.
35-
4. **Pod-to-pod communication functions over the high-performance interface:** Two pods on nodes with high-performance networking can exchange data over the specialized interface, confirming end-to-end connectivity.
32+
1. **Multiple network interfaces are available to pods:** A pod scheduled on a node with multiple network interfaces has access to secondary network interfaces beyond the default pod network.
33+
2. **Network resource characteristics are discoverable:** The characteristics of available secondary network interfaces (e.g., interface type, bandwidth, RDMA capability) are published and queryable within the cluster, enabling workloads and schedulers to make informed decisions.
34+
3. **Pods are scheduled to nodes with the required network resources:** When a workload requires a secondary network interface, it is scheduled only on nodes where that network interface is available.
35+
4. **Pod-to-pod communication functions over the secondary interface:** Two pods on nodes with secondary network interfaces can exchange data over the specialized interface, confirming end-to-end connectivity.
3636

3737
### Automated Tests
3838

3939
Automated tests should verify the outcomes above:
4040

41-
- Deploy a pod to a node with high-performance network hardware and confirm that the pod's network namespace contains the expected additional network interface(s).
42-
- Query the cluster for published network resource characteristics and validate that they accurately describe the available high-performance network capabilities.
41+
- Deploy a pod to a node with secondary network interface and confirm that the pod's network namespace contains the expected additional network interface(s).
42+
- Query the cluster for published secondary network resource.
4343
- Deploy a workload requesting a specific network capability and verify it is scheduled on an appropriate node.
44-
- Deploy two pods with access to high-performance network interfaces and verify successful pod-to-pod data transfer over those interfaces.
44+
- Deploy two pods with access to secondary network interfaces and verify successful pod-to-pod data transfer over those interfaces.
4545

4646
## Implementation History
4747

0 commit comments

Comments
 (0)