-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
What happened?
On Ubuntu 24.04 (Noble) with Cilium networking, cloud-init's network hotplug feature detects when Cilium dynamically attaches secondary ENIs and regenerates /etc/netplan/*.yaml with full policy-based routing (PBR). This adds routes for secondary interfaces to the main routing table, breaking Cilium's BPF masquerade functionality.
This issue does not occur on Ubuntu 22.04 because hotplug is disabled by default on that image.
Environment
- kOps version: 1.34.1
- Kubernetes version: 1.34.3
- Cilium version: 1.18.2
- Cloud provider: AWS
- OS Image:
099720109477/ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-arm64-server-20251212 - Instance type: m8g.large (ARM64)
Steps to reproduce
- Create a kOps cluster with Cilium networking on Ubuntu 24.04
- Wait for nodes to be ready and Cilium to attach secondary ENIs
- SSH into a node and check:
# Check netplan - will show secondary ENI with full PBR
cat /etc/netplan/*.yaml
# Check routes - will show multiple default routes
ip route
# Check cloud-init logs - will show hotplug triggered
grep -i hotplug /var/log/cloud-init.logExpected behavior
Netplan should only contain the primary ENI configuration. Cilium manages secondary ENIs directly and does not require (and is broken by) OS-level route management.
Expected ip route output:
default via 10.20.96.1 dev ens5 proto dhcp src 10.20.99.245 metric 100
Actual behavior
Cloud-init hotplug handler detects the ENI attachment and reconfigures netplan:
Actual /etc/netplan/*.yaml:
network:
version: 2
ethernets:
ens5:
dhcp4: true
dhcp4-overrides:
route-metric: 100
ens6:
dhcp4: true
dhcp4-overrides:
route-metric: 200
routes:
- table: 101
to: "0.0.0.0/0"
via: "10.20.32.1"
routing-policy:
- table: 101
from: "10.20.63.212"Actual ip route output:
default via 10.20.32.1 dev ens5 proto dhcp src 10.20.41.26 metric 100
default via 10.20.32.1 dev ens6 proto dhcp src 10.20.63.212 metric 200 # <-- breaks masquerade
cloud-init.log shows:
stages.py[DEBUG]: Event Allowed: scope=network EventType=hotplug
cc_install_hotplug.py[INFO]: Installing hotplug.
hotplug-hook called with: {subsystem: net, udevaction: add, devpath: .../net/ens6}
Why Ubuntu 22.04 works
On Ubuntu 22.04, cloud-init logs show:
stages.py[DEBUG]: Event Denied: scopes=['network'] EventType=hotplug
cc_install_hotplug.py[DEBUG]: Skipping hotplug install, not enabled
The Ubuntu 22.04 cloud image has network hotplug disabled by default.
Root cause
Ubuntu 24.04 cloud images enable cloud-init network hotplug by default. This was introduced in cloud-init PR #4799 (Feb 2024) to add automatic PBR for EC2 instances with multiple NICs.
However, this conflicts with CNI plugins like Cilium that manage secondary ENIs directly. The Cilium ENI documentation explicitly states:
"The IP address and routes on ENIs attached to the instance will be managed by the Cilium agent. Therefore, any system service trying to manage newly attached network interfaces will interfere with Cilium's configuration."
Current workaround
Users can disable hotplug via additionalUserData in each InstanceGroup:
spec:
additionalUserData:
- content: |
#cloud-config
updates:
network:
when:
- boot-new-instance
name: 00-disable-hotplug.cfg
type: text/cloud-configProposed fix
kOps should automatically disable cloud-init network hotplug when using Cilium (or Amazon VPC CNI) on Ubuntu 24.04+. This is similar to PR #17438 which added systemd-networkd configuration to prevent route removal.
Suggested implementation:
- When
networking.ciliumornetworking.amazonvpcis configured - And the OS image is Ubuntu 24.04+
- Automatically add cloud-init config to disable network hotplug
Example cloud-init config to add:
#cloud-config
updates:
network:
when:
- boot-new-instanceRelated issues
- cloud-init PR #4799 - Added automatic PBR for secondary ENIs
- cloud-init Issue #5249 - Reports this breaks production services
- kOps PR #17438 - Similar fix for systemd-networkd route removal
- kOps Issue #17433 - systemd-networkd flushing routes
- Cilium Issue #20554 - Masquerading issues with secondary interfaces
- Datadog March 2023 Outage - Similar OS-level interference with Cilium
Additional context
This issue will affect all kOps users who:
- Use Ubuntu 24.04 (Noble) images
- Use Cilium or Amazon VPC CNI
- Have instances that receive secondary ENIs
/kind bug
/area networking
/area provider/aws