Skip to content

Commit 0074ae3

Browse files
oilbeaterclaude
andauthored
fix(controller): defer subnet processing until vlan is fully ready (#6527)
* fix(controller): clear subnet IP status when vlan conflict is detected When two VLANs with the same ID are created on the same provider network, a race condition causes the conflict subnet to retain its IP range in status. This happens because: 1. The subnet handler can process the subnet before the vlan handler marks the vlan as conflicting (informer cache propagation delay of ~3ms), allowing IPAM allocation and IP status to be set. 2. When the vlan handler detects a conflict, it returns early without re-enqueuing associated subnets, so the subnet is never re-validated. 3. Even when the subnet is re-processed and detects the vlan conflict, patchSubnetStatus serializes the full status (including the stale IP range) without clearing IP fields. Fix all three issues: - In handleAddVlan/handleUpdateVlan: re-enqueue subnets referencing the conflicting vlan so they can re-validate. - In handleAddOrUpdateSubnet: when vlan validation fails, remove the subnet from IPAM and clear all IP status fields before patching. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> * fix(controller): address review feedback for vlan conflict handling Address Copilot review comments: 1. Prevent queue churn: only re-enqueue subnets when vlan conflict status transitions from false to true (not on every retry). Save wasConflict before checkVlanConflict and gate the re-enqueue. 2. Scope IPAM cleanup: introduce errVlanConflict sentinel error in validateSubnetVlan and only clear IPAM/IP status fields when errors.Is(err, errVlanConflict) is true. Transient lister errors no longer trigger IPAM deletion. 3. Add unit tests: Test_validateSubnetVlan_conflict verifies the sentinel error is correctly returned for conflict vs normal vs missing vlans. Test_handleAddOrUpdateSubnet_clearsIPStatusOnVlanConflict verifies IPAM removal and status field clearing on conflict. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> * refactor: revert vlan.go changes, keep minimal subnet-only fix Remove vlan handler re-enqueue logic — controller logs confirm subnet2 is already re-processed through the normal error retry mechanism (the reconcileVlan check at subnet.go:1687 detects conflict and returns error, triggering requeue). The only fix needed is clearing IP status fields when the subnet handler detects a vlan conflict on re-processing. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> * fix(controller): defer subnet processing until vlan is fully ready Instead of cleaning up stale IPAM state after a race condition, prevent the race entirely: do not process a subnet until its vlan has been fully handled by the vlan controller. A newly created vlan has Status.Conflict=false (zero value), which is indistinguishable from a processed non-conflicting vlan. Use pn.Status.Vlans as a "vlan ready" signal — it is only populated after handleAddVlan completes successfully (including the conflict check). In validateSubnetVlan, after confirming the vlan is not conflicting, verify it appears in the provider network's Status.Vlans. If not, the vlan has not been fully processed yet, so return an error to defer subnet processing until the vlan handler re-enqueues it. In handleAddVlan, re-enqueue subnets referencing a conflicting vlan so they can see the updated conflict status and be properly rejected. This replaces the previous sentinel-error + IPAM-cleanup approach with a simpler ordering guarantee that eliminates the race window entirely. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> * fix(controller): address review - silent requeue for unready vlan - Introduce errVlanNotReady sentinel: when the vlan hasn't been fully processed yet (empty provider, provider network not found, or vlan not in pn.Status.Vlans), return errVlanNotReady instead of a generic error. handleAddOrUpdateSubnet checks for this and requeues silently without patching subnet status as ValidateSubnetVlanFailed, avoiding misleading Warning events for a normal transient condition. - Treat empty vlan.Spec.Provider as "not ready" (vlan handler hasn't run yet to default the provider) rather than falling back to DefaultProviderName. - Add unit test for empty-provider vlan case. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> * fix(controller): maintain pn.Status.Vlans in handleUpdateVlan handleUpdateVlan did not update the provider network's Status.Vlans list after a successful conflict check. This caused a regression where validateSubnetVlan's new pn.Status.Vlans readiness check blocked subnet processing after a vlan provider change — the vlan was not registered in the new provider network's Vlans list, so all associated subnets were stuck in "vlan not ready" state. Add the same pn.Status.Vlans maintenance logic that handleAddVlan already has: after checkVlanConflict passes, ensure the vlan appears in its provider network's Status.Vlans. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> --------- Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent edd3de2 commit 0074ae3

File tree

4 files changed

+159
-1
lines changed

4 files changed

+159
-1
lines changed

pkg/controller/controller_test.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ type FakeControllerOptions struct {
8080
VpcNatGateways []*kubeovnv1.VpcNatGateway
8181
IPs []*kubeovnv1.IP
8282
Vlans []*kubeovnv1.Vlan
83+
ProviderNetworks []*kubeovnv1.ProviderNetwork
8384
NetworkAttachments []*nadv1.NetworkAttachmentDefinition
8485
Pods []*corev1.Pod
8586
Nodes []*corev1.Node
@@ -158,6 +159,13 @@ func newFakeControllerWithOptions(t *testing.T, opts *FakeControllerOptions) (*f
158159
return nil, err
159160
}
160161
}
162+
for _, pn := range opts.ProviderNetworks {
163+
_, err := kubeovnClient.KubeovnV1().ProviderNetworks().Create(
164+
context.Background(), pn, metav1.CreateOptions{})
165+
if err != nil {
166+
return nil, err
167+
}
168+
}
161169

162170
// Create informer factories
163171
kubeInformerFactory := informers.NewSharedInformerFactoryWithOptions(kubeClient, 0,
@@ -192,6 +200,7 @@ func newFakeControllerWithOptions(t *testing.T, opts *FakeControllerOptions) (*f
192200
ipInformer := kubeovnInformerFactory.Kubeovn().V1().IPs()
193201
vpcNatGwInformer := kubeovnInformerFactory.Kubeovn().V1().VpcNatGateways()
194202
vlanInformer := kubeovnInformerFactory.Kubeovn().V1().Vlans()
203+
providerNetworkInformer := kubeovnInformerFactory.Kubeovn().V1().ProviderNetworks()
195204

196205
fakeInformers := &fakeControllerInformers{
197206
vpcInformer: vpcInformer,
@@ -223,6 +232,7 @@ func newFakeControllerWithOptions(t *testing.T, opts *FakeControllerOptions) (*f
223232
ipsLister: ipInformer.Lister(),
224233
ipSynced: alwaysReady,
225234
vlansLister: vlanInformer.Lister(),
235+
providerNetworksLister: providerNetworkInformer.Lister(),
226236
netAttachLister: nadInformer.Lister(),
227237
netAttachSynced: alwaysReady,
228238
OVNNbClient: mockOvnClient,

pkg/controller/subnet.go

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,11 @@ func (c *Controller) formatSubnet(subnet *kubeovnv1.Subnet) (*kubeovnv1.Subnet,
163163
return subnet, nil
164164
}
165165

166+
// errVlanNotReady is returned when the vlan has not been fully processed by
167+
// the vlan handler yet. The caller should requeue silently without patching
168+
// the subnet status as failed.
169+
var errVlanNotReady = errors.New("vlan not ready")
170+
166171
func (c *Controller) validateSubnetVlan(subnet *kubeovnv1.Subnet) error {
167172
if subnet.Spec.Vlan == "" {
168173
return nil
@@ -180,6 +185,25 @@ func (c *Controller) validateSubnetVlan(subnet *kubeovnv1.Subnet) error {
180185
klog.Error(err)
181186
return err
182187
}
188+
189+
// Ensure the vlan has been fully processed by the vlan handler before
190+
// allowing the subnet to proceed. A newly created vlan has Conflict=false
191+
// (zero value) which is indistinguishable from a processed non-conflicting
192+
// vlan. Use pn.Status.Vlans as a "vlan ready" signal — only set after
193+
// the vlan handler completes successfully (including conflict check).
194+
// This prevents a race where the subnet allocates IPAM before the vlan's
195+
// conflict status is determined.
196+
if vlan.Spec.Provider == "" {
197+
return fmt.Errorf("vlan %s provider not yet set, deferring subnet %s: %w", vlan.Name, subnet.Name, errVlanNotReady)
198+
}
199+
pn, err := c.providerNetworksLister.Get(vlan.Spec.Provider)
200+
if err != nil {
201+
return fmt.Errorf("vlan %s not yet ready (provider network %s not found): %w", vlan.Name, vlan.Spec.Provider, errVlanNotReady)
202+
}
203+
if !slices.Contains(pn.Status.Vlans, vlan.Name) {
204+
return fmt.Errorf("vlan %s not yet processed by vlan handler, deferring subnet %s: %w", vlan.Name, subnet.Name, errVlanNotReady)
205+
}
206+
183207
return nil
184208
}
185209

@@ -491,6 +515,12 @@ func (c *Controller) handleAddOrUpdateSubnet(key string) error {
491515

492516
err = c.validateSubnetVlan(subnet)
493517
if err != nil {
518+
if errors.Is(err, errVlanNotReady) {
519+
// vlan hasn't been processed yet, requeue silently without
520+
// marking the subnet as failed
521+
klog.Infof("deferring subnet %s: %v", key, err)
522+
return err
523+
}
494524
err = fmt.Errorf("failed to validate vlan for subnet %s, %w", key, err)
495525
klog.Error(err)
496526
if patchErr := c.patchSubnetStatus(subnet, "ValidateSubnetVlanFailed", err.Error()); patchErr != nil {

pkg/controller/subnet_test.go

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -604,3 +604,97 @@ func Test_checkSubnetConflict(t *testing.T) {
604604
require.NoError(t, err)
605605
})
606606
}
607+
608+
func Test_validateSubnetVlan(t *testing.T) {
609+
t.Parallel()
610+
611+
pn := &kubeovnv1.ProviderNetwork{
612+
ObjectMeta: metav1.ObjectMeta{Name: "test-pn"},
613+
Status: kubeovnv1.ProviderNetworkStatus{
614+
Vlans: []string{"ready-vlan"},
615+
},
616+
}
617+
conflictVlan := &kubeovnv1.Vlan{
618+
ObjectMeta: metav1.ObjectMeta{Name: "conflict-vlan"},
619+
Spec: kubeovnv1.VlanSpec{ID: 100, Provider: "test-pn"},
620+
Status: kubeovnv1.VlanStatus{Conflict: true},
621+
}
622+
readyVlan := &kubeovnv1.Vlan{
623+
ObjectMeta: metav1.ObjectMeta{Name: "ready-vlan"},
624+
Spec: kubeovnv1.VlanSpec{ID: 200, Provider: "test-pn"},
625+
Status: kubeovnv1.VlanStatus{Conflict: false},
626+
}
627+
unprocessedVlan := &kubeovnv1.Vlan{
628+
ObjectMeta: metav1.ObjectMeta{Name: "unprocessed-vlan"},
629+
Spec: kubeovnv1.VlanSpec{ID: 300, Provider: "test-pn"},
630+
Status: kubeovnv1.VlanStatus{Conflict: false}, // same as ready, but NOT in pn.Status.Vlans
631+
}
632+
emptyProviderVlan := &kubeovnv1.Vlan{
633+
ObjectMeta: metav1.ObjectMeta{Name: "empty-provider-vlan"},
634+
Spec: kubeovnv1.VlanSpec{ID: 400, Provider: ""}, // not yet defaulted by vlan handler
635+
}
636+
637+
fakeCtrl, err := newFakeControllerWithOptions(t, &FakeControllerOptions{
638+
Vlans: []*kubeovnv1.Vlan{conflictVlan, readyVlan, unprocessedVlan, emptyProviderVlan},
639+
ProviderNetworks: []*kubeovnv1.ProviderNetwork{pn},
640+
})
641+
require.NoError(t, err)
642+
ctrl := fakeCtrl.fakeController
643+
644+
t.Run("conflict vlan is rejected", func(t *testing.T) {
645+
subnet := &kubeovnv1.Subnet{
646+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
647+
Spec: kubeovnv1.SubnetSpec{Vlan: "conflict-vlan"},
648+
}
649+
err := ctrl.validateSubnetVlan(subnet)
650+
require.Error(t, err)
651+
require.Contains(t, err.Error(), "conflict")
652+
})
653+
654+
t.Run("ready vlan passes validation", func(t *testing.T) {
655+
subnet := &kubeovnv1.Subnet{
656+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
657+
Spec: kubeovnv1.SubnetSpec{Vlan: "ready-vlan"},
658+
}
659+
err := ctrl.validateSubnetVlan(subnet)
660+
require.NoError(t, err)
661+
})
662+
663+
t.Run("unprocessed vlan defers subnet processing", func(t *testing.T) {
664+
subnet := &kubeovnv1.Subnet{
665+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
666+
Spec: kubeovnv1.SubnetSpec{Vlan: "unprocessed-vlan"},
667+
}
668+
err := ctrl.validateSubnetVlan(subnet)
669+
require.Error(t, err)
670+
require.ErrorIs(t, err, errVlanNotReady)
671+
})
672+
673+
t.Run("empty provider vlan is not ready", func(t *testing.T) {
674+
subnet := &kubeovnv1.Subnet{
675+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
676+
Spec: kubeovnv1.SubnetSpec{Vlan: "empty-provider-vlan"},
677+
}
678+
err := ctrl.validateSubnetVlan(subnet)
679+
require.Error(t, err)
680+
require.ErrorIs(t, err, errVlanNotReady)
681+
})
682+
683+
t.Run("missing vlan returns error", func(t *testing.T) {
684+
subnet := &kubeovnv1.Subnet{
685+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
686+
Spec: kubeovnv1.SubnetSpec{Vlan: "nonexistent-vlan"},
687+
}
688+
err := ctrl.validateSubnetVlan(subnet)
689+
require.Error(t, err)
690+
})
691+
692+
t.Run("empty vlan passes validation", func(t *testing.T) {
693+
subnet := &kubeovnv1.Subnet{
694+
ObjectMeta: metav1.ObjectMeta{Name: "test-subnet"},
695+
Spec: kubeovnv1.SubnetSpec{Vlan: ""},
696+
}
697+
err := ctrl.validateSubnetVlan(subnet)
698+
require.NoError(t, err)
699+
})
700+
}

pkg/controller/vlan.go

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,12 @@ func (c *Controller) handleAddVlan(key string) error {
122122

123123
if err = c.checkVlanConflict(vlan); err != nil {
124124
klog.Errorf("failed to check vlan %s: %v", vlan.Name, err)
125+
// re-enqueue subnets so they can see the conflict status and reject
126+
for _, subnet := range subnets {
127+
if subnet.Spec.Vlan == vlan.Name {
128+
c.addOrUpdateSubnetQueue.Add(subnet.Name)
129+
}
130+
}
125131
return err
126132
}
127133

@@ -136,7 +142,7 @@ func (c *Controller) handleAddVlan(key string) error {
136142
}
137143

138144
// re-enqueue subnets that reference this vlan, so they can proceed
139-
// if they were previously blocked by the vlan not being ready
145+
// now that the vlan is fully processed
140146
for _, subnet := range subnets {
141147
if subnet.Spec.Vlan == vlan.Name {
142148
c.addOrUpdateSubnetQueue.Add(subnet.Name)
@@ -207,6 +213,24 @@ func (c *Controller) handleUpdateVlan(key string) error {
207213
return err
208214
}
209215

216+
// ensure the vlan is registered in the provider network's Status.Vlans,
217+
// which is used by validateSubnetVlan as a "vlan ready" signal.
218+
// This is especially important after a provider change, where the vlan
219+
// needs to appear in the new provider network's Vlans list.
220+
pn, err := c.providerNetworksLister.Get(vlan.Spec.Provider)
221+
if err != nil {
222+
klog.Errorf("failed to get provider network %s: %v", vlan.Spec.Provider, err)
223+
return err
224+
}
225+
if !slices.Contains(pn.Status.Vlans, vlan.Name) {
226+
newPn := pn.DeepCopy()
227+
newPn.Status.Vlans = append(newPn.Status.Vlans, vlan.Name)
228+
if _, err = c.config.KubeOvnClient.KubeovnV1().ProviderNetworks().UpdateStatus(context.Background(), newPn, metav1.UpdateOptions{}); err != nil {
229+
klog.Errorf("failed to update status of provider network %s: %v", pn.Name, err)
230+
return err
231+
}
232+
}
233+
210234
subnets, err := c.subnetsLister.List(labels.Everything())
211235
if err != nil {
212236
klog.Errorf("failed to list subnets: %v", err)

0 commit comments

Comments
 (0)