Skip to content

fix(e2e): wait for flow cleanup before checking service after deletion in metallb test#6359

Merged
oilbeater merged 1 commit intomasterfrom
fix/e2e-metallb-wait-flow-cleanup
Feb 27, 2026
Merged

fix(e2e): wait for flow cleanup before checking service after deletion in metallb test#6359
oilbeater merged 1 commit intomasterfrom
fix/e2e-metallb-wait-flow-cleanup

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • The metallb e2e test should support metallb and underlay combine fails intermittently with curl exit code 56 (connection reset) when checking the second service's reachability immediately after deleting the first service
  • Root cause: the OVN LoadBalancer data plane update and the daemon's syncFlows (ovs-ofctl replace-flows) have not yet completed when the test checks the second service, causing a brief disruption to the remaining service's data path
  • Added waitUnderlayServiceFlowCleaned helper that polls all nodes until the deleted service's OpenFlow rules are fully removed from the external bridge, ensuring the data plane has stabilized before connectivity checks

Test plan

  • Run the metallb e2e test suite multiple times to verify the flaky test no longer fails
  • Verify the new wait logic correctly detects flow rule cleanup within the 15s timeout

🤖 Generated with Claude Code

…n in metallb test

The metallb e2e test "should support metallb and underlay combine" fails
intermittently with curl exit code 56 (connection reset) when checking
the second service's reachability immediately after deleting the first
service. This happens because the OVN LoadBalancer data plane update and
the daemon's syncFlows (ovs-ofctl replace-flows) have not yet completed,
causing a brief disruption to the remaining service's data path.

Add a waitUnderlayServiceFlowCleaned helper that polls all nodes until
the deleted service's OpenFlow rules are fully removed from the external
bridge, ensuring the data plane has stabilized before checking the
second service.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 27, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a persistent flakiness in a metallb end-to-end test by introducing a crucial synchronization mechanism. It ensures that the network data plane has fully stabilized after a service deletion, specifically by waiting for associated OpenFlow rules to be cleared, before validating the reachability of other services. This prevents race conditions and improves the reliability of the test suite.

Highlights

  • Flaky Test Fix: Resolved an intermittent curl exit code 56 failure in the metallb e2e test should support metallb and underlay combine.
  • Root Cause Addressed: The issue stemmed from OVN LoadBalancer data plane updates and syncFlows not completing before subsequent service reachability checks, causing temporary data path disruptions.
  • New Helper Function: Introduced waitUnderlayServiceFlowCleaned to poll all nodes, ensuring that deleted service's OpenFlow rules are fully removed from the external bridge before proceeding with connectivity checks.
Changelog
  • test/e2e/metallb/e2e_test.go
    • Integrated a call to waitUnderlayServiceFlowCleaned after deleting the first service to ensure OpenFlow rules are cleaned up.
    • Implemented the new helper function waitUnderlayServiceFlowCleaned which polls nodes to verify the removal of specific OpenFlow rules.
Activity
  • The pull request description indicates it was generated with Claude Code.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the test automation tests label Feb 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a flaky e2e test by waiting for OpenFlow rules to be cleaned up after a service deletion. The approach is sound. However, the implementation of the check in the new waitUnderlayServiceFlowCleaned function could be more robust. The current use of grep -w with an IP address is not reliable. I've provided a suggestion to improve the check by using grep -F with more specific patterns, which will make the test more stable and accurate.

Comment on lines +623 to +639
func waitUnderlayServiceFlowCleaned(nodeNames []string, providerNetworkName, serviceIP string, servicePort int32, timeout time.Duration) {
ginkgo.GinkgoHelper()

bridgeName := util.ExternalBridgeName(providerNetworkName)
matchPort := fmt.Sprintf("tp_dst=%d", servicePort)

framework.WaitUntil(1*time.Second, timeout, func(_ context.Context) (bool, error) {
for _, nodeName := range nodeNames {
cmd := fmt.Sprintf("kubectl ko ofctl %s dump-flows %s | grep -w %s | grep -w %s",
nodeName, bridgeName, serviceIP, matchPort)
if _, err := exec.Command("bash", "-c", cmd).CombinedOutput(); err == nil {
return false, nil // flow still exists on this node
}
}
return true, nil // flow cleaned from all nodes
}, fmt.Sprintf("underlay service flow for %s should be cleaned up", serviceIP))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of grep -w to match an IP address is not reliable. The grep utility considers words to be sequences of letters, digits, and underscores. An IP address string containing dots is not treated as a single word, so grep -w will not match it as intended. This can cause the check to fail incorrectly.

Additionally, without the -F flag, grep treats the pattern as a regular expression, where . is a wildcard for any character. This could lead to incorrect matches (e.g., 1.2.3.4 could match 1a2b3c4).

To make the check more robust and correct, I suggest using grep -F for fixed-string matching and constructing a more specific pattern that includes the OpenFlow field name, such as nw_dst=<IP> or ipv6_dst=<IP>. This avoids both the word-boundary issue and the regex wildcard issue.

func waitUnderlayServiceFlowCleaned(nodeNames []string, providerNetworkName, serviceIP string, servicePort int32, timeout time.Duration) {
	ginkgo.GinkgoHelper()

	bridgeName := util.ExternalBridgeName(providerNetworkName)
	matchPort := fmt.Sprintf("tp_dst=%d", servicePort)

	var ipMatch string
	if strings.Contains(serviceIP, ":") {
		ipMatch = fmt.Sprintf("ipv6_dst=%s", serviceIP)
	} else {
		ipMatch = fmt.Sprintf("nw_dst=%s", serviceIP)
	}

	framework.WaitUntil(1*time.Second, timeout, func(_ context.Context) (bool, error) {
		for _, nodeName := range nodeNames {
			cmd := fmt.Sprintf("kubectl ko ofctl %s dump-flows %s | grep -F -- %q | grep -F -- %q",
				nodeName, bridgeName, ipMatch, matchPort)
			if _, err := exec.Command("bash", "-c", cmd).CombinedOutput(); err == nil {
				return false, nil // flow still exists on this node
			}
		}
		return true, nil // flow cleaned from all nodes
	}, fmt.Sprintf("underlay service flow for %s should be cleaned up", serviceIP))
}

@oilbeater oilbeater merged commit f9865af into master Feb 27, 2026
74 of 76 checks passed
@oilbeater oilbeater deleted the fix/e2e-metallb-wait-flow-cleanup branch February 27, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant