Conversation
Summary of ChangesHello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where subnets could become permanently stuck if their associated VLAN was not ready, particularly during controller startup. By fixing variable shadowing that swallowed validation errors and implementing a mechanism to re-enqueue subnets once their VLAN becomes available, the controller's robustness and reliability are significantly improved, preventing hangs related to subnet readiness. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides a solid fix for a critical bug where subnets could get permanently stuck if their associated VLAN was not ready. The changes correctly address the root causes: swallowed validation errors in handleAddOrUpdateSubnet and the missing re-enqueuing of subnets in handleAddVlan. The new unit test is a valuable addition that verifies the fix. I have one minor suggestion to improve error handling consistency.
| klog.Error(patchErr) | ||
| return patchErr | ||
| } | ||
| return err |
There was a problem hiding this comment.
For consistency with the VLAN validation error handling above, and to provide better context to the caller, consider wrapping this error before returning it. This will help in debugging by providing a more complete error trace in the logs of the calling worker function.
| return err | |
| return fmt.Errorf("failed to validate subnet %s: %w", subnet.Name, err) |
8cd67f2 to
7b0213b
Compare
…eady Fix two bugs that combine to cause underlay subnets to get permanently stuck during controller startup when the VLAN is created after the subnet. Bug 1: In handleAddOrUpdateSubnet, variable shadowing (err :=) and overwriting (err =) in the VLAN/subnet validation error paths caused patchSubnetStatus success to zero out the original validation error. The handler returned nil, making the work queue forget the item instead of retrying it. Fix by using a separate patchErr variable for the patch call and using = instead of := for the error wrapping. Bug 2: handleAddVlan did not re-enqueue subnets that reference the newly created VLAN. Once a subnet's validation failed and was forgotten by the queue, no event would trigger it to be reprocessed. Fix by iterating over subnets at the end of handleAddVlan and adding those referencing the VLAN back to the addOrUpdateSubnetQueue. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
7b0213b to
ac1ff02
Compare
…eady (#6352) Fix two bugs that combine to cause underlay subnets to get permanently stuck during controller startup when the VLAN is created after the subnet. Bug 1: In handleAddOrUpdateSubnet, variable shadowing (err :=) and overwriting (err =) in the VLAN/subnet validation error paths caused patchSubnetStatus success to zero out the original validation error. The handler returned nil, making the work queue forget the item instead of retrying it. Fix by using a separate patchErr variable for the patch call and using = instead of := for the error wrapping. Bug 2: handleAddVlan did not re-enqueue subnets that reference the newly created VLAN. Once a subnet's validation failed and was forgotten by the queue, no event would trigger it to be reprocessed. Fix by iterating over subnets at the end of handleAddVlan and adding those referencing the VLAN back to the addOrUpdateSubnetQueue. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> (cherry picked from commit e9b65ce)
Summary
handleAddOrUpdateSubnetthat caused VLAN and subnet validation errors to be swallowed, preventing work queue retrieshandleAddVlanso subnets blocked by a missing VLAN get reprocessed once the VLAN is createdhandleAddOrUpdateSubnetcorrectly returns errors when VLAN validation failsDetails
Bug 1 (subnet.go:496-514): When
validateSubnetVlanorValidateSubnetfailed, the error was passed topatchSubnetStatus. If the patch succeeded, the original validation error was overwritten withnil(due toerr :=shadowing anderr =reuse), causing the handler to returnnil. The work queue then calledForget(item)instead ofAddRateLimited(item), so the subnet was never retried.Bug 2 (vlan.go:53-120):
handleAddVlanupdated the VLAN's status with associated subnets but did not re-enqueue those subnets. If a subnet had already been forgotten by the queue due to Bug 1, no event would trigger it to be reprocessed after the VLAN became available.Combined effect: During controller startup in underlay mode, if the subnet was processed before its VLAN was created, the subnet would get permanently stuck, causing
allSubnetReady()to never return true and the controller to hang at "wait for subnets ready".Test plan
make lintpasses (0 issues)go test ./pkg/controller/passes — all existing + new tests passTest_handleAddOrUpdateSubnet_vlanValidationErrorverifies the error is correctly returned🤖 Generated with Claude Code