Skip to content

fix RT-7.5 flakiness#5258

Open
aks03dev wants to merge 3 commits intoopenconfig:mainfrom
b4firex:fix/rt-7.5
Open

fix RT-7.5 flakiness#5258
aks03dev wants to merge 3 commits intoopenconfig:mainfrom
b4firex:fix/rt-7.5

Conversation

@aks03dev
Copy link
Copy Markdown

Fixes for the flakiness seen on RT-7.5

Proposed changes

1. validateRouteCommunityV4Prefix / validateRouteCommunityV6Prefix — WatchAll + LookupAll hybrid

What changed. The WatchAll predicate is now scoped to the specific prefix under test and evaluates expected community content (standard 100:100, link-bandwidth cases, or "none" including rejection of stray link-bandwidth extended communities where required), not merely “any prefix appeared.” If WatchAll times out, the test logs and continues to final validation instead of returning early; the authoritative pass/fail is the subsequent phase. Final validation uses getV4Prefixes / getV6Prefixes, which call gnmi.LookupAll (empty slice when nothing is present) instead of gnmi.GetAll (which fatals on empty).

Why — the core problem. The original flow was effectively: wait until something shows up with WatchAll, then GetAll all prefixes and assert. That mixes change detection with authoritative RIB snapshot. Three failure modes drove flakes: (a) accumulated ExtendedCommunity entries from multiple notifications caused intermittent bandwidth mismatches; (b) for "none", a predicate like len(prefix.Community) == 0 could never succeed if accumulated communities lingered in the STREAM cache; (c) a transient withdrawal between “saw a prefix” and GetAll produced an empty list and immediate test failure. The fix uses WatchAll only as convergence signaling and LookupAll (ONCE-style read) for assertions, avoiding the accumulator for final truth

2. enableExtCommunityCLIConfig — remove unnecessary sleep

3. validateImportPolicyDut — consolidated prefix counting

What changed. Replaced the chain WatchAll (any prefix) → GetAll → per-prefix Watch with subnet checks with a single WatchAll over UnicastIpv4PrefixAny().State() whose predicate keeps a map[string]bool of distinct addresses inside the expected subnet (parseV4) and succeeds when three distinct matching prefixes have been seen; on timeout the failure message includes how many were observed. The gate uses a single 2-minute timeout instead of a mix of shorter waits.

Why. The old sequence raced: the first WatchAll only proved at least one prefix existed, not that the set of three was stable between GetAll and per-prefix watches. Counting inside one subscription ties “three prefixes in subnet” to one continuous stream and removes the flakiness

4. validateImportRoutingPolicyAllowAll — Watch instead of Get for policy verification

What changed. For IPv4 and IPv6 apply-policy state, gnmi.Get was replaced with gnmi.Watch (30s timeout) while keeping the same expectation: exactly one import policy named allow-all.

Why. After removeImportAndExportPolicy and applyImportPolicyDut, OpenConfig state can lag config briefly; an immediate Get may see empty or stale policy names. Watch waits until the predicate holds or the timeout fires, matching operational “ready when state matches.”

5. checkTraffic — single retry on packet loss

What changed. Traffic start → 30s run → stop → metrics is wrapped so that if loss exceeds 1% on the first attempt, the test logs once and repeats the whole measurement; only the second attempt’s outcome can produce the final failure. The 1% threshold is unchanged for the attempt that matters.

6. baseSetupConfigAndVerification — explicit prefix waits before traffic

What changed. After base BGP setup (and import-policy validation when not skipped by deviations), added WatchAll waits (2 minutes) for IPv4 and IPv6 unicast prefixes on the OTG peer before creating traffic flows.

Why. Flows assume destinations are already learned on OTG port2; starting traffic earlier can fail for control-plane timing unrelated to link-bandwidth policy correctness.

@aks03dev aks03dev requested a review from a team as a code owner March 25, 2026 10:29
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on addressing flakiness in BGP link-bandwidth tests by refining how network state and policy convergence are observed and validated. The changes introduce more robust waiting mechanisms, consolidate prefix counting logic, and improve the reliability of traffic checks, leading to more stable and accurate test results.

Highlights

  • Enhanced Route Community Validation: The validateRouteCommunityV4Prefix and validateRouteCommunityV6Prefix functions were refactored to use a 'WatchAll + LookupAll' hybrid approach. The WatchAll predicate is now scoped to the specific prefix and evaluates expected community content, acting as a convergence signal. Final validation uses gnmi.LookupAll to get an authoritative RIB snapshot, avoiding issues with accumulated communities or transient withdrawals.
  • Optimized Extended Community CLI Configuration: An unnecessary time.Sleep was removed from enableExtCommunityCLIConfig, and the configuration application for IPv4 and IPv6 neighbors was streamlined into a loop.
  • Consolidated Import Policy Prefix Counting: The validateImportPolicyDut function was improved by replacing a multi-step WatchAll -> GetAll -> per-prefix Watch sequence with a single WatchAll that uses an in-predicate map to count distinct prefixes within the expected subnet, ensuring more reliable convergence detection.
  • Improved Import Routing Policy Verification: In validateImportRoutingPolicyAllowAll, gnmi.Get calls for verifying IPv4 and IPv6 apply-policy states were replaced with gnmi.Watch with a 30-second timeout. This change accounts for potential lag between configuration and state updates, ensuring the test waits for the policy to be correctly applied.
  • Traffic Check Retry Mechanism: The checkTraffic function now includes a single retry attempt if packet loss exceeds 1% on the initial measurement, allowing for transient network conditions without immediate test failure.
  • Explicit Prefix Waits Before Traffic: Explicit WatchAll waits (2 minutes) for IPv4 and IPv6 unicast prefixes on the OTG peer were added to baseSetupConfigAndVerification before initiating traffic flows. This ensures that destinations are learned before traffic starts, preventing failures due to control-plane timing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@OpenConfigBot
Copy link
Copy Markdown

OpenConfigBot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors several BGP policy and traffic validation functions to improve test reliability, reduce code duplication, and enhance clarity. Key changes include using loops for CLI configuration, streamlining prefix validation with a single gnmi.WatchAll and a map, introducing helper functions for link bandwidth community checks, and implementing gnmi.Watch for policy convergence. Additionally, a retry mechanism was added for traffic validation, and explicit waits for prefix advertisement were introduced before traffic generation. The review suggests further refactoring of the getV4Prefixes and getV6Prefixes functions, as well as the prefix waiting logic in baseSetupConfigAndVerification, into generic helpers to further reduce code duplication.

Comment on lines +462 to +510
func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix {
t.Helper()
var result []*otgtelemetry.BgpPeer_UnicastIpv4Prefix
vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State())
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
if len(result) == 0 {
t.Logf("V4 prefixes not present, waiting for re-advertisement...")
gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv4Prefix]) bool {
_, present := v.Val()
return present
}).Await(t)
vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State())
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
}
return result
}

func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix {
t.Helper()
var result []*otgtelemetry.BgpPeer_UnicastIpv6Prefix
vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State())
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
if len(result) == 0 {
t.Logf("V6 prefixes not present, waiting for re-advertisement...")
gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv6Prefix]) bool {
_, present := v.Val()
return present
}).Await(t)
vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State())
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
}
return result
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions getV4Prefixes and getV6Prefixes contain nearly identical logic. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider refactoring this into a single generic function.

A similar pattern of duplication exists in baseSetupConfigAndVerification where you wait for IPv4 and IPv6 prefixes (lines 1297-1304). This could also be extracted into a generic helper.

Here are some examples of how you could implement these generic helpers:

For getting prefixes:

func getPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], logName string) []T {
	t.Helper()
	var result []T
	vals := gnmi.LookupAll(t, otg, query)
	for _, v := range vals {
		if val, ok := v.Val(); ok {
			result = append(result, val)
		}
	}
	if len(result) == 0 {
		t.Logf("%s prefixes not present, waiting for re-advertisement...", logName)
		gnmi.WatchAll(t, otg, query, 30*time.Second, func(v *ygnmi.Value[T]) bool {
			_, present := v.Val()
			return present
		}).Await(t)
		vals = gnmi.LookupAll(t, otg, query)
		for _, v := range vals {
			if val, ok := v.Val(); ok {
				result = append(result, val)
			}
		}
	}
	return result
}

// You could then refactor getV4Prefixes and getV6Prefixes:
func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix {
	t.Helper()
	query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP4.peer").UnicastIpv4PrefixAny().State()
	return getPrefixes(t, td.ate.OTG(), query, "V4")
}

func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix {
	t.Helper()
	query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP6.peer").UnicastIpv6PrefixAny().State()
	return getPrefixes(t, td.ate.OTG(), query, "V6")
}

For awaiting prefixes:

func awaitPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], afi string) {
	t.Helper()
	t.Logf("Waiting for %s prefixes...", afi)
	gnmi.WatchAll(t, otg, query, 2*time.Minute, func(v *ygnmi.Value[T]) bool {
		_, present := v.Val()
		return present
	}).Await(t)
}

// Then call it in baseSetupConfigAndVerification:
// awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), "IPv4")
// awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), "IPv6")

This will make the code cleaner and easier to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants