fix RT-7.5 flakiness by aks03dev · Pull Request #5258 · openconfig/featureprofiles

aks03dev · 2026-03-25T10:29:50Z

Fixes for the flakiness seen on RT-7.5

Proposed changes

1. `validateRouteCommunityV4Prefix` / `validateRouteCommunityV6Prefix` — WatchAll + LookupAll hybrid

What changed. The WatchAll predicate is now scoped to the specific prefix under test and evaluates expected community content (standard 100:100, link-bandwidth cases, or "none" including rejection of stray link-bandwidth extended communities where required), not merely “any prefix appeared.” If WatchAll times out, the test logs and continues to final validation instead of returning early; the authoritative pass/fail is the subsequent phase. Final validation uses getV4Prefixes / getV6Prefixes, which call gnmi.LookupAll (empty slice when nothing is present) instead of gnmi.GetAll (which fatals on empty).

Why — the core problem. The original flow was effectively: wait until something shows up with WatchAll, then GetAll all prefixes and assert. That mixes change detection with authoritative RIB snapshot. Three failure modes drove flakes: (a) accumulated ExtendedCommunity entries from multiple notifications caused intermittent bandwidth mismatches; (b) for "none", a predicate like len(prefix.Community) == 0 could never succeed if accumulated communities lingered in the STREAM cache; (c) a transient withdrawal between “saw a prefix” and GetAll produced an empty list and immediate test failure. The fix uses WatchAll only as convergence signaling and LookupAll (ONCE-style read) for assertions, avoiding the accumulator for final truth

2. `enableExtCommunityCLIConfig` — remove unnecessary sleep

3. `validateImportPolicyDut` — consolidated prefix counting

What changed. Replaced the chain WatchAll (any prefix) → GetAll → per-prefix Watch with subnet checks with a single WatchAll over UnicastIpv4PrefixAny().State() whose predicate keeps a map[string]bool of distinct addresses inside the expected subnet (parseV4) and succeeds when three distinct matching prefixes have been seen; on timeout the failure message includes how many were observed. The gate uses a single 2-minute timeout instead of a mix of shorter waits.

Why. The old sequence raced: the first WatchAll only proved at least one prefix existed, not that the set of three was stable between GetAll and per-prefix watches. Counting inside one subscription ties “three prefixes in subnet” to one continuous stream and removes the flakiness

4. `validateImportRoutingPolicyAllowAll` — Watch instead of Get for policy verification

What changed. For IPv4 and IPv6 apply-policy state, gnmi.Get was replaced with gnmi.Watch (30s timeout) while keeping the same expectation: exactly one import policy named allow-all.

Why. After removeImportAndExportPolicy and applyImportPolicyDut, OpenConfig state can lag config briefly; an immediate Get may see empty or stale policy names. Watch waits until the predicate holds or the timeout fires, matching operational “ready when state matches.”

5. `checkTraffic` — single retry on packet loss

What changed. Traffic start → 30s run → stop → metrics is wrapped so that if loss exceeds 1% on the first attempt, the test logs once and repeats the whole measurement; only the second attempt’s outcome can produce the final failure. The 1% threshold is unchanged for the attempt that matters.

6. `baseSetupConfigAndVerification` — explicit prefix waits before traffic

What changed. After base BGP setup (and import-policy validation when not skipped by deviations), added WatchAll waits (2 minutes) for IPv4 and IPv6 unicast prefixes on the OTG peer before creating traffic flows.

Why. Flows assume destinations are already learned on OTG port2; starting traffic earlier can fail for control-plane timing unrelated to link-bandwidth policy correctness.

gemini-code-assist · 2026-03-25T10:30:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on addressing flakiness in BGP link-bandwidth tests by refining how network state and policy convergence are observed and validated. The changes introduce more robust waiting mechanisms, consolidate prefix counting logic, and improve the reliability of traffic checks, leading to more stable and accurate test results.

Highlights

Enhanced Route Community Validation: The validateRouteCommunityV4Prefix and validateRouteCommunityV6Prefix functions were refactored to use a 'WatchAll + LookupAll' hybrid approach. The WatchAll predicate is now scoped to the specific prefix and evaluates expected community content, acting as a convergence signal. Final validation uses gnmi.LookupAll to get an authoritative RIB snapshot, avoiding issues with accumulated communities or transient withdrawals.
Optimized Extended Community CLI Configuration: An unnecessary time.Sleep was removed from enableExtCommunityCLIConfig, and the configuration application for IPv4 and IPv6 neighbors was streamlined into a loop.
Consolidated Import Policy Prefix Counting: The validateImportPolicyDut function was improved by replacing a multi-step WatchAll -> GetAll -> per-prefix Watch sequence with a single WatchAll that uses an in-predicate map to count distinct prefixes within the expected subnet, ensuring more reliable convergence detection.
Improved Import Routing Policy Verification: In validateImportRoutingPolicyAllowAll, gnmi.Get calls for verifying IPv4 and IPv6 apply-policy states were replaced with gnmi.Watch with a 30-second timeout. This change accounts for potential lag between configuration and state updates, ensuring the test waits for the policy to be correctly applied.
Traffic Check Retry Mechanism: The checkTraffic function now includes a single retry attempt if packet loss exceeds 1% on the initial measurement, allowing for transient network conditions without immediate test failure.
Explicit Prefix Waits Before Traffic: Explicit WatchAll waits (2 minutes) for IPv4 and IPv6 unicast prefixes on the OTG peer were added to baseSetupConfigAndVerification before initiating traffic flows. This ensures that destinations are learned before traffic starts, preventing failures due to control-plane timing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

OpenConfigBot · 2026-03-25T10:30:14Z

Pull Request Functional Test Report for #5258 / `36c712a`

Virtual Devices

Device	Test	Test Documentation	Job	Raw Log
Arista cEOS		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Cisco 8000E		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Cisco XRd		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Juniper ncPTX		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Nokia SR Linux		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Openconfig Lemming		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community

Hardware Devices

Device	Test	Test Documentation	Raw Log
Arista 7808		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Cisco 8808		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Juniper PTX10008		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community
Nokia 7250 IXR-10e		RT-7.5: BGP Policy - Match and Set Link Bandwidth Community

Help

gemini-code-assist

Code Review

This pull request refactors several BGP policy and traffic validation functions to improve test reliability, reduce code duplication, and enhance clarity. Key changes include using loops for CLI configuration, streamlining prefix validation with a single gnmi.WatchAll and a map, introducing helper functions for link bandwidth community checks, and implementing gnmi.Watch for policy convergence. Additionally, a retry mechanism was added for traffic validation, and explicit waits for prefix advertisement were introduced before traffic generation. The review suggests further refactoring of the getV4Prefixes and getV6Prefixes functions, as well as the prefix waiting logic in baseSetupConfigAndVerification, into generic helpers to further reduce code duplication.

gemini-code-assist · 2026-03-25T10:32:42Z

feature/bgp/policybase/otg_tests/link_bandwidth_test/link_bandwidth_test.go

+func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix {
+	t.Helper()
+	var result []*otgtelemetry.BgpPeer_UnicastIpv4Prefix
+	vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State())
+	for _, v := range vals {
+		if val, ok := v.Val(); ok {
+			result = append(result, val)
+		}
+	}
+	if len(result) == 0 {
+		t.Logf("V4 prefixes not present, waiting for re-advertisement...")
+		gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv4Prefix]) bool {
+			_, present := v.Val()
+			return present
+		}).Await(t)
+		vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State())
+		for _, v := range vals {
+			if val, ok := v.Val(); ok {
+				result = append(result, val)
+			}
+		}
+	}
+	return result
+}
+
+func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix {
+	t.Helper()
+	var result []*otgtelemetry.BgpPeer_UnicastIpv6Prefix
+	vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State())
+	for _, v := range vals {
+		if val, ok := v.Val(); ok {
+			result = append(result, val)
+		}
+	}
+	if len(result) == 0 {
+		t.Logf("V6 prefixes not present, waiting for re-advertisement...")
+		gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv6Prefix]) bool {
+			_, present := v.Val()
+			return present
+		}).Await(t)
+		vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State())
+		for _, v := range vals {
+			if val, ok := v.Val(); ok {
+				result = append(result, val)
+			}
+		}
+	}
+	return result
+}


The functions getV4Prefixes and getV6Prefixes contain nearly identical logic. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider refactoring this into a single generic function.

A similar pattern of duplication exists in baseSetupConfigAndVerification where you wait for IPv4 and IPv6 prefixes (lines 1297-1304). This could also be extracted into a generic helper.

Here are some examples of how you could implement these generic helpers:

For getting prefixes:

func getPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], logName string) []T { t.Helper() var result []T vals := gnmi.LookupAll(t, otg, query) for _, v := range vals { if val, ok := v.Val(); ok { result = append(result, val) } } if len(result) == 0 { t.Logf("%s prefixes not present, waiting for re-advertisement...", logName) gnmi.WatchAll(t, otg, query, 30*time.Second, func(v *ygnmi.Value[T]) bool { _, present := v.Val() return present }).Await(t) vals = gnmi.LookupAll(t, otg, query) for _, v := range vals { if val, ok := v.Val(); ok { result = append(result, val) } } } return result } // You could then refactor getV4Prefixes and getV6Prefixes: func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix { t.Helper() query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP4.peer").UnicastIpv4PrefixAny().State() return getPrefixes(t, td.ate.OTG(), query, "V4") } func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix { t.Helper() query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP6.peer").UnicastIpv6PrefixAny().State() return getPrefixes(t, td.ate.OTG(), query, "V6") }

For awaiting prefixes:

func awaitPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], afi string) { t.Helper() t.Logf("Waiting for %s prefixes...", afi) gnmi.WatchAll(t, otg, query, 2*time.Minute, func(v *ygnmi.Value[T]) bool { _, present := v.Val() return present }).Await(t) } // Then call it in baseSetupConfigAndVerification: // awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), "IPv4") // awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), "IPv6")

This will make the code cleaner and easier to maintain.

fix RT-7.5 flakiness

b5c2e0d

aks03dev requested a review from a team as a code owner March 25, 2026 10:29

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

Merge branch 'main' into fix/rt-7.5

3c9fc2e

karthikeya-remilla added the cisco-pr label Mar 25, 2026

Merge branch 'main' into fix/rt-7.5

36c712a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix RT-7.5 flakiness#5258

fix RT-7.5 flakiness#5258
aks03dev wants to merge 3 commits intoopenconfig:mainfrom
b4firex:fix/rt-7.5

aks03dev commented Mar 25, 2026

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Uh oh!

OpenConfigBot commented Mar 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aks03dev commented Mar 25, 2026

Proposed changes

1. validateRouteCommunityV4Prefix / validateRouteCommunityV6Prefix — WatchAll + LookupAll hybrid

2. enableExtCommunityCLIConfig — remove unnecessary sleep

3. validateImportPolicyDut — consolidated prefix counting

4. validateImportRoutingPolicyAllowAll — Watch instead of Get for policy verification

5. checkTraffic — single retry on packet loss

6. baseSetupConfigAndVerification — explicit prefix waits before traffic

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

OpenConfigBot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Functional Test Report for #5258 / 36c712a

Virtual Devices

Hardware Devices

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `validateRouteCommunityV4Prefix` / `validateRouteCommunityV6Prefix` — WatchAll + LookupAll hybrid

2. `enableExtCommunityCLIConfig` — remove unnecessary sleep

3. `validateImportPolicyDut` — consolidated prefix counting

4. `validateImportRoutingPolicyAllowAll` — Watch instead of Get for policy verification

5. `checkTraffic` — single retry on packet loss

6. `baseSetupConfigAndVerification` — explicit prefix waits before traffic

OpenConfigBot commented Mar 25, 2026 •

edited

Loading

Pull Request Functional Test Report for #5258 / `36c712a`