Conversation
…n in metallb test The metallb e2e test "should support metallb and underlay combine" fails intermittently with curl exit code 56 (connection reset) when checking the second service's reachability immediately after deleting the first service. This happens because the OVN LoadBalancer data plane update and the daemon's syncFlows (ovs-ofctl replace-flows) have not yet completed, causing a brief disruption to the remaining service's data path. Add a waitUnderlayServiceFlowCleaned helper that polls all nodes until the deleted service's OpenFlow rules are fully removed from the external bridge, ensuring the data plane has stabilized before checking the second service. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a persistent flakiness in a metallb end-to-end test by introducing a crucial synchronization mechanism. It ensures that the network data plane has fully stabilized after a service deletion, specifically by waiting for associated OpenFlow rules to be cleared, before validating the reachability of other services. This prevents race conditions and improves the reliability of the test suite. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix a flaky e2e test by waiting for OpenFlow rules to be cleaned up after a service deletion. The approach is sound. However, the implementation of the check in the new waitUnderlayServiceFlowCleaned function could be more robust. The current use of grep -w with an IP address is not reliable. I've provided a suggestion to improve the check by using grep -F with more specific patterns, which will make the test more stable and accurate.
| func waitUnderlayServiceFlowCleaned(nodeNames []string, providerNetworkName, serviceIP string, servicePort int32, timeout time.Duration) { | ||
| ginkgo.GinkgoHelper() | ||
|
|
||
| bridgeName := util.ExternalBridgeName(providerNetworkName) | ||
| matchPort := fmt.Sprintf("tp_dst=%d", servicePort) | ||
|
|
||
| framework.WaitUntil(1*time.Second, timeout, func(_ context.Context) (bool, error) { | ||
| for _, nodeName := range nodeNames { | ||
| cmd := fmt.Sprintf("kubectl ko ofctl %s dump-flows %s | grep -w %s | grep -w %s", | ||
| nodeName, bridgeName, serviceIP, matchPort) | ||
| if _, err := exec.Command("bash", "-c", cmd).CombinedOutput(); err == nil { | ||
| return false, nil // flow still exists on this node | ||
| } | ||
| } | ||
| return true, nil // flow cleaned from all nodes | ||
| }, fmt.Sprintf("underlay service flow for %s should be cleaned up", serviceIP)) | ||
| } |
There was a problem hiding this comment.
The use of grep -w to match an IP address is not reliable. The grep utility considers words to be sequences of letters, digits, and underscores. An IP address string containing dots is not treated as a single word, so grep -w will not match it as intended. This can cause the check to fail incorrectly.
Additionally, without the -F flag, grep treats the pattern as a regular expression, where . is a wildcard for any character. This could lead to incorrect matches (e.g., 1.2.3.4 could match 1a2b3c4).
To make the check more robust and correct, I suggest using grep -F for fixed-string matching and constructing a more specific pattern that includes the OpenFlow field name, such as nw_dst=<IP> or ipv6_dst=<IP>. This avoids both the word-boundary issue and the regex wildcard issue.
func waitUnderlayServiceFlowCleaned(nodeNames []string, providerNetworkName, serviceIP string, servicePort int32, timeout time.Duration) {
ginkgo.GinkgoHelper()
bridgeName := util.ExternalBridgeName(providerNetworkName)
matchPort := fmt.Sprintf("tp_dst=%d", servicePort)
var ipMatch string
if strings.Contains(serviceIP, ":") {
ipMatch = fmt.Sprintf("ipv6_dst=%s", serviceIP)
} else {
ipMatch = fmt.Sprintf("nw_dst=%s", serviceIP)
}
framework.WaitUntil(1*time.Second, timeout, func(_ context.Context) (bool, error) {
for _, nodeName := range nodeNames {
cmd := fmt.Sprintf("kubectl ko ofctl %s dump-flows %s | grep -F -- %q | grep -F -- %q",
nodeName, bridgeName, ipMatch, matchPort)
if _, err := exec.Command("bash", "-c", cmd).CombinedOutput(); err == nil {
return false, nil // flow still exists on this node
}
}
return true, nil // flow cleaned from all nodes
}, fmt.Sprintf("underlay service flow for %s should be cleaned up", serviceIP))
}
Summary
should support metallb and underlay combinefails intermittently withcurl exit code 56(connection reset) when checking the second service's reachability immediately after deleting the first servicesyncFlows(ovs-ofctl replace-flows) have not yet completed when the test checks the second service, causing a brief disruption to the remaining service's data pathwaitUnderlayServiceFlowCleanedhelper that polls all nodes until the deleted service's OpenFlow rules are fully removed from the external bridge, ensuring the data plane has stabilized before connectivity checksTest plan
🤖 Generated with Claude Code