-
Notifications
You must be signed in to change notification settings - Fork 7
KAR-10: High-Performance Pod-to-Pod Communication #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # KAR-0010: High-Performance Pod-to-Pod Communication | ||
|
|
||
| ## Description | ||
|
|
||
| If high performance pod-to-pod communication is needed, then provide well-defined mechanisms for these specialized network resources to be managed and exposed such that their characteristics should be discoverable to enable informed scheduling or workload configuration and to enable pods to attach to multiple network interfaces. | ||
|
|
||
| Forward-looking: Once the network resource supports DRA, then the platform should use the DRA mechanism. | ||
|
|
||
| ## Motivation | ||
|
|
||
| AI/ML workloads, particularly distributed training and inference, require high-throughput, low-latency pod-to-pod communication. These workloads often rely on specialized network hardware (e.g., RDMA-capable NICs, high-speed interconnects) that must be explicitly attached to pods. Without a standardized mechanism for discovering and allocating these network resources, users face fragmented tooling, inconsistent behavior across platforms, and difficulty ensuring workloads are scheduled on nodes with the appropriate network capabilities. | ||
|
|
||
| By leveraging Dynamic Resource Allocation (DRA) for network resources, platforms can provide a consistent, Kubernetes-native way to expose high-performance network interfaces to pods. This enables workloads to discover available network characteristics and make informed scheduling decisions, improving portability and reducing the operational burden of running distributed AI/ML workloads on Kubernetes. | ||
|
|
||
| ## Graduation Criteria | ||
|
|
||
| **SHOULD** | ||
| - [X] Describe how users can test it for self-attestation with scripts, documentation, etc | ||
| - [ ] Starting v1.37, new SHOULDs must include proposed automated tests in the automated tests section below | ||
|
|
||
| **MUST** | ||
| - [ ] Starting v1.37, new MUSTs must include automated tests that have been added to the AI conformance test suite | ||
| - [ ] Demonstrate at least two real-world usage of SHOULD before graduating to MUST | ||
| - [ ] Kubernetes core APIs must be GA | ||
|
|
||
| ## Test Plan | ||
|
|
||
| ### How We Might Test It | ||
|
|
||
| Validate the following observable outcomes: | ||
|
|
||
| 1. **Multiple network interfaces are available to pods:** A pod scheduled on a node with high-performance network hardware has access to additional network interfaces beyond the default pod network. | ||
| 2. **Network resource characteristics are discoverable:** The characteristics of available high-performance network resources (e.g., interface type, bandwidth, RDMA capability) are published and queryable within the cluster, enabling workloads and schedulers to make informed decisions. | ||
| 3. **Pods are scheduled to nodes with the required network resources:** When a workload requires a specific high-performance network capability, it is scheduled only on nodes where that capability is available. | ||
| 4. **Pod-to-pod communication functions over the high-performance interface:** Two pods on nodes with high-performance networking can exchange data over the specialized interface, confirming end-to-end connectivity. | ||
|
|
||
| ### Automated Tests | ||
|
|
||
| Automated tests should verify the outcomes above: | ||
|
|
||
| - Deploy a pod to a node with high-performance network hardware and confirm that the pod's network namespace contains the expected additional network interface(s). | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering if we want to lean on the "high performance" aspect or the "additional interface" aspect of these connections. "High performance" is tricky because 10Gbps might be high performance in an edge cluster, but would be unacceptably slow in a training cluster. I think the key functionality we want to describe is the idea that some clusters can offer more bandwidth / lower latency / less jitter etc to workloads that request it. The way that is exposed to the pod is not by configuring the primary interface, but by configuring a secondary interface in the pod. I agree that the goal is higher performance today, but the objective difference is that there's a second interface which has different properties from the "pod network". (At least IIUC). As to whether we want to "bake in" the complexity of requiring a second interface... I don't know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking in a similar direction with the comment above. IMO defining "high performance" will be quite impractical as it boils down to "it depends". Somehow related also we talked if we should maybe lean mainly on the "use DRA for this" aspect as Janet mentioned above, which would help us avoid "baking it in", right?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1
+1 Updated few places s/high performance/secondary network interface PTAL |
||
| - Query the cluster for published network resource characteristics and validate that they accurately describe the available high-performance network capabilities. | ||
| - Deploy a workload requesting a specific network capability and verify it is scheduled on an appropriate node. | ||
| - Deploy two pods with access to high-performance network interfaces and verify successful pod-to-pod data transfer over those interfaces. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just for clarification, do we want to consider successful transfer as accepted or do we want to see certain throughput measures to? IMHO it's fine to just test that it works and trust vendors on actual performance, as actually setting up thresholds when we would consider something high-performance, which I feel might be too much, as we need to consider typical values for current hardware implementations and differentiate same node vs note-to-node solutions.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1
This only talks about testing the transfer works, no mention of testing performance. I think we are good here? |
||
|
|
||
| ## Implementation History | ||
|
|
||
| 2026-02-22: KAR created | ||
|
|
||
| ## Related KARs | ||
|
|
||
| <!-- | ||
| List KARS that are related. This is in case of additional requirements that come up after a KAR has already graduated to "implemented" | ||
| --> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| title: "High-Performance Pod-to-Pod Communication" | ||
| kar-number: 10 | ||
| participating-sigs: | ||
| - sig-network | ||
| - sig-node | ||
| # The current status of this KAR. | ||
| # Implementable: Part of a Kubernetes release as a SHOULD or MUST, implementation is ongoing. | ||
| # Implemented: Has been part of one or more Kubernetes releases and has graduated to MUST. | ||
| # Implementation is complete. Further changes should be made via new KARs. | ||
| status: implementable | ||
| creation-date: "2026-02-22" | ||
|
|
||
| # see-also: | ||
| # - "/kars/1234-another-kar" | ||
| # replaces: | ||
| # - "/kars/3456-replaced-kar" | ||
|
|
||
| # The target maturity stage in the current dev cycle for this KAR. | ||
| # If the purpose of this KAR is to deprecate an existing requirement | ||
| # then they should be deprecated|disabled|removed. | ||
| stage: should | ||
|
|
||
| # The most recent milestone for which work toward delivery of this KAR has been | ||
| # done. This can be the current (upcoming) milestone, if it is being actively | ||
| # worked on. | ||
| latest-milestone: "v1.36" | ||
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| should: "v1.36" |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have DRANET today, so it's not just forward looking.
Also, as @aojea highlighted in #10 (comment), the upstream networking community has already decided to "use DRA for anything multi network so we can standardize the ecosystem using common APIs". We should make standardizing on DRA the primary recommendation now, rather than an eventual goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "Forward-looking" section is mirroring other KARs that mention using DRA as the mechanism as forward looking. This was feedback from users who have expressed that many vendors are still catching up and may not have a supported DRA implementation yet. e.g. https://github.com/kubernetes-sigs/wg-ai-conformance/blob/f8773d3f2ffed4aa23442df8413f76c642412e8b/kars/0003-gpu-sharing/README.md?plain=1#L7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD itself signals the direction, so unlike in MUSTs we don't need to say "forward-looking" here if it's available today. We will only graduate it to MUST after it has met the criteria, so vendors will still have time to catch up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both https://github.com/kubernetes-sigs/wg-ai-conformance/blob/f8773d3f2ffed4aa23442df8413f76c642412e8b/kars/0003-gpu-sharing/README.md?plain=1#L7 and https://github.com/kubernetes-sigs/wg-ai-conformance/blob/f8773d3f2ffed4aa23442df8413f76c642412e8b/kars/0004-virtualized-accelerators/README.md?plain=1#L7 are SHOULDs and similar to the concerns raised in those, not all secondary network interfaces have been integrated with DRANet right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dranet is agnostic for network interfaces, so all network interfaces can work with dranet is true.
But ... there is infiniband interfaces that are presented as linux devices (not as network interfaces) but are also used for RDMA that are not supported yet by DRANET.
There was also a PR to implement it google/dranet#151 but it didn't merge because it could not be tested ... if I have access to the hardware I can add support very quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aojea for the clarification! This is helpful. It confirms that DRANET is broadly applicable today, with the specific exception of InfiniBand interfaces exposed as Linux devices rather than network interfaces but is getting worked on. Given this, I've updated this KAR to make DRA the primary recommendation and remove the 'forward-looking' framing.