Skip to content

fix: prevent subnet from getting permanently stuck when VLAN is not ready#6352

Merged
oilbeater merged 1 commit intomasterfrom
fix-underlay-subnet-stuck-on-vlan-not-ready
Feb 28, 2026
Merged

fix: prevent subnet from getting permanently stuck when VLAN is not ready#6352
oilbeater merged 1 commit intomasterfrom
fix-underlay-subnet-stuck-on-vlan-not-ready

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • Fix variable shadowing/overwriting in handleAddOrUpdateSubnet that caused VLAN and subnet validation errors to be swallowed, preventing work queue retries
  • Add subnet re-enqueue logic in handleAddVlan so subnets blocked by a missing VLAN get reprocessed once the VLAN is created
  • Add unit test to verify handleAddOrUpdateSubnet correctly returns errors when VLAN validation fails

Details

Bug 1 (subnet.go:496-514): When validateSubnetVlan or ValidateSubnet failed, the error was passed to patchSubnetStatus. If the patch succeeded, the original validation error was overwritten with nil (due to err := shadowing and err = reuse), causing the handler to return nil. The work queue then called Forget(item) instead of AddRateLimited(item), so the subnet was never retried.

Bug 2 (vlan.go:53-120): handleAddVlan updated the VLAN's status with associated subnets but did not re-enqueue those subnets. If a subnet had already been forgotten by the queue due to Bug 1, no event would trigger it to be reprocessed after the VLAN became available.

Combined effect: During controller startup in underlay mode, if the subnet was processed before its VLAN was created, the subnet would get permanently stuck, causing allSubnetReady() to never return true and the controller to hang at "wait for subnets ready".

Test plan

  • make lint passes (0 issues)
  • go test ./pkg/controller/ passes — all existing + new tests pass
  • New unit test Test_handleAddOrUpdateSubnet_vlanValidationError verifies the error is correctly returned
  • Verify in Talos Installation Test (ipv4, underlay) that the controller no longer hangs

🤖 Generated with Claude Code

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working test automation tests labels Feb 26, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where subnets could become permanently stuck if their associated VLAN was not ready, particularly during controller startup. By fixing variable shadowing that swallowed validation errors and implementing a mechanism to re-enqueue subnets once their VLAN becomes available, the controller's robustness and reliability are significantly improved, preventing hangs related to subnet readiness.

Highlights

  • Error Handling Improvement: Resolved variable shadowing in handleAddOrUpdateSubnet to correctly propagate VLAN and subnet validation errors, ensuring work queue retries instead of silently dropping errors.
  • Subnet Re-enqueue Logic: Implemented logic in handleAddVlan to re-enqueue subnets associated with a newly ready VLAN, preventing them from being permanently stuck if they were processed before their VLAN was available.
  • New Unit Test: Introduced a new unit test, Test_handleAddOrUpdateSubnet_vlanValidationError, to confirm handleAddOrUpdateSubnet properly returns errors when VLAN validation fails.
Changelog
  • pkg/controller/controller_test.go
    • Imported record and keymutex packages.
    • Added vlanInformer to the fakeControllerInformers struct.
    • Included a Vlans field in the FakeControllerOptions struct.
    • Modified newFakeControllerWithOptions to create fake VLANs and initialize the vlanInformer.
    • Initialized vlansLister, recorder, subnetKeyMutex, and addOrUpdateSubnetQueue in the fake controller setup.
  • pkg/controller/subnet.go
    • Corrected error assignment in handleAddOrUpdateSubnet to prevent variable shadowing when calling patchSubnetStatus, ensuring validation errors are properly returned.
  • pkg/controller/subnet_test.go
    • Added Test_handleAddOrUpdateSubnet_vlanValidationError to verify that handleAddOrUpdateSubnet returns an error when a subnet references a non-existent VLAN.
  • pkg/controller/vlan.go
    • Added a loop in handleAddVlan to re-enqueue subnets that reference the processed VLAN, allowing them to be re-evaluated.
Activity
  • Identified and fixed variable shadowing in handleAddOrUpdateSubnet that caused validation errors to be suppressed.
  • Implemented a mechanism to re-enqueue subnets when their associated VLAN becomes ready, addressing a scenario where subnets could be forgotten by the work queue.
  • Added a new unit test, Test_handleAddOrUpdateSubnet_vlanValidationError, to validate the improved error handling in handleAddOrUpdateSubnet.
  • Confirmed that make lint and existing go test ./pkg/controller/ pass.
  • Pending verification in a Talos Installation Test for ipv4 underlay.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a solid fix for a critical bug where subnets could get permanently stuck if their associated VLAN was not ready. The changes correctly address the root causes: swallowed validation errors in handleAddOrUpdateSubnet and the missing re-enqueuing of subnets in handleAddVlan. The new unit test is a valuable addition that verifies the fix. I have one minor suggestion to improve error handling consistency.

klog.Error(patchErr)
return patchErr
}
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the VLAN validation error handling above, and to provide better context to the caller, consider wrapping this error before returning it. This will help in debugging by providing a more complete error trace in the logs of the calling worker function.

Suggested change
return err
return fmt.Errorf("failed to validate subnet %s: %w", subnet.Name, err)

@oilbeater oilbeater force-pushed the fix-underlay-subnet-stuck-on-vlan-not-ready branch 2 times, most recently from 8cd67f2 to 7b0213b Compare February 27, 2026 01:58
…eady

Fix two bugs that combine to cause underlay subnets to get permanently
stuck during controller startup when the VLAN is created after the subnet.

Bug 1: In handleAddOrUpdateSubnet, variable shadowing (err :=) and
overwriting (err =) in the VLAN/subnet validation error paths caused
patchSubnetStatus success to zero out the original validation error.
The handler returned nil, making the work queue forget the item instead
of retrying it. Fix by using a separate patchErr variable for the patch
call and using = instead of := for the error wrapping.

Bug 2: handleAddVlan did not re-enqueue subnets that reference the
newly created VLAN. Once a subnet's validation failed and was forgotten
by the queue, no event would trigger it to be reprocessed. Fix by
iterating over subnets at the end of handleAddVlan and adding those
referencing the VLAN back to the addOrUpdateSubnetQueue.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@oilbeater oilbeater force-pushed the fix-underlay-subnet-stuck-on-vlan-not-ready branch from 7b0213b to ac1ff02 Compare February 27, 2026 09:23
@oilbeater oilbeater merged commit e9b65ce into master Feb 28, 2026
74 of 76 checks passed
@oilbeater oilbeater deleted the fix-underlay-subnet-stuck-on-vlan-not-ready branch February 28, 2026 14:46
oilbeater added a commit that referenced this pull request Feb 28, 2026
…eady (#6352)

Fix two bugs that combine to cause underlay subnets to get permanently
stuck during controller startup when the VLAN is created after the subnet.

Bug 1: In handleAddOrUpdateSubnet, variable shadowing (err :=) and
overwriting (err =) in the VLAN/subnet validation error paths caused
patchSubnetStatus success to zero out the original validation error.
The handler returned nil, making the work queue forget the item instead
of retrying it. Fix by using a separate patchErr variable for the patch
call and using = instead of := for the error wrapping.

Bug 2: handleAddVlan did not re-enqueue subnets that reference the
newly created VLAN. Once a subnet's validation failed and was forgotten
by the queue, no event would trigger it to be reprocessed. Fix by
iterating over subnets at the end of handleAddVlan and adding those
referencing the VLAN back to the addOrUpdateSubnetQueue.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit e9b65ce)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant