Skip to content

Support LACP bonded network interfaces #239

Description

@Bowser1704

Problem

On many GPU training nodes, dual-port NICs (e.g., ConnectX-7 with 2 physical ports) are bonded together using LACP (802.3ad) at the infrastructure level. This is not a vendor-specific setup — CX7 is a single PCIe device with two ports, and in many network environments (e.g., rail-optimized fabrics), the two ports are configured as an LACP bond rather than exposed as two independent interfaces. The bonding happens at the OS/network level, not at the hardware level.

A typical H20 GPU node topology:

PCIe 0000:7f:00.0 (CX7, 2 ports) → bond0 (LACP 802.3ad)
PCIe 0000:c7:00.0 (CX7, 2 ports) → bond1 (LACP 802.3ad)
PCIe 0001:08:00.0 (CX7, 2 ports) → bond2 (LACP 802.3ad)
PCIe 0001:a2:00.0 (CX7, 2 ports) → bond3 (LACP 802.3ad)

dranet discovers these bond interfaces and publishes them in the ResourceSlice. However, when a pod claims one, Prepare fails because bond devices with LACP cannot be moved into a pod network namespace:

  • Moving the bond breaks LACP negotiation with the switch
  • Moving individual slave ports out of the bond is not supported by the kernel
  • The bond must remain in the host namespace to maintain link aggregation

Relationship to #63

#63 proposes using IPvlan to share a single NIC across multiple pods, using allowMultipleAllocations and consumable capacity to model the sharing.

This issue is different. We are not trying to share — the user still wants exclusive use of the NIC. The problem is purely that the bond cannot be moved. IPvlan here is a transport mechanism to work around the LACP constraint, not a sharing mechanism.

From the user's perspective, claiming a bonded NIC should feel the same as claiming a regular NIC — they should not need to know whether the underlying interface is a bond or not.

Proposal

When dranet detects that a network interface is a bond in LACP mode, it should automatically create an IPvlan child interface and move the child into the pod's namespace, instead of attempting to move the bond itself.

Host namespace:                    Pod namespace:
  bond0 (LACP, stays on host)       ipvlan0 ← child of bond0
  bond1 (LACP, stays on host)       ipvlan1 ← child of bond1
  bond2 (LACP, stays on host)       ipvlan2 ← child of bond2
  bond3 (LACP, stays on host)       ipvlan3 ← child of bond3

This should be transparent to the scheduler and to the user:

  • The bond is published in the ResourceSlice as a normal device (no allowMultipleAllocations, no capacity)
  • The user requests it with a normal allocationMode: All or ExactCount
  • The scheduler allocates it exclusively as usual
  • During Prepare / RunPodSandbox, dranet detects the bond + LACP and creates an IPvlan child instead of calling ip link set netns

The detection logic could be:

  1. Check if the interface is a bond (/sys/class/net/<ifname>/bonding/mode)
  2. If the bond mode is 802.3ad (LACP) or other modes where moving is unsafe, create an IPvlan child
  3. Move the IPvlan child into the pod namespace and configure it (IP, routes, etc.)

RDMA char devices (/dev/infiniband/uverbs*, rdma_cm) would be injected alongside as in the existing IB-only path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions