diff --git a/kars/0010-high-performance-pod-to-pod-communication/README.md b/kars/0010-high-performance-pod-to-pod-communication/README.md new file mode 100644 index 0000000..31beb9d --- /dev/null +++ b/kars/0010-high-performance-pod-to-pod-communication/README.md @@ -0,0 +1,52 @@ +# KAR-0010: High-Performance Pod-to-Pod Communication + +## Description + +If high performance pod-to-pod communication is needed, then provide well-defined mechanisms for these specialized network resources to be managed and exposed such that their characteristics should be discoverable to enable informed scheduling or workload configuration and to enable pods to attach to multiple network interfaces. Platforms should use DRA as the mechanism to manage and expose these specialized network resources (e.g., DRANET). + +## Motivation + +AI/ML workloads, particularly distributed training and inference, require high-throughput, low-latency pod-to-pod communication. These workloads often rely on specialized network hardware (e.g., RDMA-capable NICs, high-speed interconnects) that must be explicitly attached to pods. Without a standardized mechanism for discovering and allocating these network resources, users face fragmented tooling, inconsistent behavior across platforms, and difficulty ensuring workloads are scheduled on nodes with the appropriate network capabilities. + +By leveraging Dynamic Resource Allocation (DRA) for network resources, platforms can provide a consistent, Kubernetes-native way to expose secondary network interfaces to pods. This enables workloads to discover available network characteristics and make informed scheduling decisions, improving portability and reducing the operational burden of running distributed AI/ML workloads on Kubernetes. + +## Graduation Criteria + +**SHOULD** +- [X] Describe how users can test it for self-attestation with scripts, documentation, etc +- [ ] Starting v1.37, new SHOULDs must include proposed automated tests in the automated tests section below + +**MUST** +- [ ] Starting v1.37, new MUSTs must include automated tests that have been added to the AI conformance test suite +- [ ] Demonstrate at least two real-world usage of SHOULD before graduating to MUST +- [ ] Kubernetes core APIs must be GA + +## Test Plan + +### How We Might Test It + +Validate the following observable outcomes: + +1. **Multiple network interfaces are available to pods:** A pod scheduled on a node with multiple network interfaces has access to secondary network interfaces beyond the default pod network. +2. **Network resource characteristics are discoverable:** The characteristics of available secondary network interfaces (e.g., interface type, bandwidth, RDMA capability, PCI bus ID, NUMA node affinity) are published and queryable within the cluster, enabling workloads and schedulers to make informed decisions. +3. **Pods are scheduled to nodes with the required network resources:** When a workload requires a secondary network interface, it is scheduled only on nodes where that network interface is available. +4. **Pod-to-pod communication functions over the secondary interface:** Two pods on nodes with secondary network interfaces can exchange data over the specialized interface, confirming end-to-end connectivity. + +### Automated Tests + +Automated tests should verify the outcomes above: + +- Deploy a pod to a node with secondary network interface and confirm that the pod's network namespace contains the expected additional network interface(s). +- Query the cluster for published secondary network resource. +- Deploy a workload requesting a specific network capability and verify it is scheduled on an appropriate node. +- Deploy two pods with access to secondary network interfaces and verify successful pod-to-pod data transfer over those interfaces. + +## Implementation History + +2026-02-22: KAR created + +## Related KARs + + diff --git a/kars/0010-high-performance-pod-to-pod-communication/kar.yaml b/kars/0010-high-performance-pod-to-pod-communication/kar.yaml new file mode 100644 index 0000000..8231484 --- /dev/null +++ b/kars/0010-high-performance-pod-to-pod-communication/kar.yaml @@ -0,0 +1,30 @@ +title: "High-Performance Pod-to-Pod Communication" +kar-number: 10 +participating-sigs: + - sig-network + - sig-node +# The current status of this KAR. +# Implementable: Part of a Kubernetes release as a SHOULD or MUST, implementation is ongoing. +# Implemented: Has been part of one or more Kubernetes releases and has graduated to MUST. +# Implementation is complete. Further changes should be made via new KARs. +status: implementable +creation-date: "2026-02-22" + +# see-also: +# - "/kars/1234-another-kar" +# replaces: +# - "/kars/3456-replaced-kar" + +# The target maturity stage in the current dev cycle for this KAR. +# If the purpose of this KAR is to deprecate an existing requirement +# then they should be deprecated|disabled|removed. +stage: should + +# The most recent milestone for which work toward delivery of this KAR has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.36" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + should: "v1.36"