Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on addressing flakiness in BGP link-bandwidth tests by refining how network state and policy convergence are observed and validated. The changes introduce more robust waiting mechanisms, consolidate prefix counting logic, and improve the reliability of traffic checks, leading to more stable and accurate test results. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Pull Request Functional Test Report for #5258 / 36c712aVirtual Devices
Hardware Devices
|
There was a problem hiding this comment.
Code Review
This pull request refactors several BGP policy and traffic validation functions to improve test reliability, reduce code duplication, and enhance clarity. Key changes include using loops for CLI configuration, streamlining prefix validation with a single gnmi.WatchAll and a map, introducing helper functions for link bandwidth community checks, and implementing gnmi.Watch for policy convergence. Additionally, a retry mechanism was added for traffic validation, and explicit waits for prefix advertisement were introduced before traffic generation. The review suggests further refactoring of the getV4Prefixes and getV6Prefixes functions, as well as the prefix waiting logic in baseSetupConfigAndVerification, into generic helpers to further reduce code duplication.
| func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix { | ||
| t.Helper() | ||
| var result []*otgtelemetry.BgpPeer_UnicastIpv4Prefix | ||
| vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State()) | ||
| for _, v := range vals { | ||
| if val, ok := v.Val(); ok { | ||
| result = append(result, val) | ||
| } | ||
| } | ||
| if len(result) == 0 { | ||
| t.Logf("V4 prefixes not present, waiting for re-advertisement...") | ||
| gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv4Prefix]) bool { | ||
| _, present := v.Val() | ||
| return present | ||
| }).Await(t) | ||
| vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State()) | ||
| for _, v := range vals { | ||
| if val, ok := v.Val(); ok { | ||
| result = append(result, val) | ||
| } | ||
| } | ||
| } | ||
| return result | ||
| } | ||
|
|
||
| func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix { | ||
| t.Helper() | ||
| var result []*otgtelemetry.BgpPeer_UnicastIpv6Prefix | ||
| vals := gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State()) | ||
| for _, v := range vals { | ||
| if val, ok := v.Val(); ok { | ||
| result = append(result, val) | ||
| } | ||
| } | ||
| if len(result) == 0 { | ||
| t.Logf("V6 prefixes not present, waiting for re-advertisement...") | ||
| gnmi.WatchAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), 30*time.Second, func(v *ygnmi.Value[*otgtelemetry.BgpPeer_UnicastIpv6Prefix]) bool { | ||
| _, present := v.Val() | ||
| return present | ||
| }).Await(t) | ||
| vals = gnmi.LookupAll(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State()) | ||
| for _, v := range vals { | ||
| if val, ok := v.Val(); ok { | ||
| result = append(result, val) | ||
| } | ||
| } | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
The functions getV4Prefixes and getV6Prefixes contain nearly identical logic. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, consider refactoring this into a single generic function.
A similar pattern of duplication exists in baseSetupConfigAndVerification where you wait for IPv4 and IPv6 prefixes (lines 1297-1304). This could also be extracted into a generic helper.
Here are some examples of how you could implement these generic helpers:
For getting prefixes:
func getPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], logName string) []T {
t.Helper()
var result []T
vals := gnmi.LookupAll(t, otg, query)
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
if len(result) == 0 {
t.Logf("%s prefixes not present, waiting for re-advertisement...", logName)
gnmi.WatchAll(t, otg, query, 30*time.Second, func(v *ygnmi.Value[T]) bool {
_, present := v.Val()
return present
}).Await(t)
vals = gnmi.LookupAll(t, otg, query)
for _, v := range vals {
if val, ok := v.Val(); ok {
result = append(result, val)
}
}
}
return result
}
// You could then refactor getV4Prefixes and getV6Prefixes:
func getV4Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv4Prefix {
t.Helper()
query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP4.peer").UnicastIpv4PrefixAny().State()
return getPrefixes(t, td.ate.OTG(), query, "V4")
}
func getV6Prefixes(t *testing.T, td testData) []*otgtelemetry.BgpPeer_UnicastIpv6Prefix {
t.Helper()
query := gnmi.OTG().BgpPeer(td.otgP2.Name() + ".BGP6.peer").UnicastIpv6PrefixAny().State()
return getPrefixes(t, td.ate.OTG(), query, "V6")
}For awaiting prefixes:
func awaitPrefixes[T any](t *testing.T, otg *otg.OTG, query ygnmi.WildcardQuery[T], afi string) {
t.Helper()
t.Logf("Waiting for %s prefixes...", afi)
gnmi.WatchAll(t, otg, query, 2*time.Minute, func(v *ygnmi.Value[T]) bool {
_, present := v.Val()
return present
}).Await(t)
}
// Then call it in baseSetupConfigAndVerification:
// awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP4.peer").UnicastIpv4PrefixAny().State(), "IPv4")
// awaitPrefixes(t, td.ate.OTG(), gnmi.OTG().BgpPeer(td.otgP2.Name()+".BGP6.peer").UnicastIpv6PrefixAny().State(), "IPv6")This will make the code cleaner and easier to maintain.
Fixes for the flakiness seen on RT-7.5
Proposed changes
1.
validateRouteCommunityV4Prefix/validateRouteCommunityV6Prefix— WatchAll + LookupAll hybridWhat changed. The
WatchAllpredicate is now scoped to the specific prefix under test and evaluates expected community content (standard100:100, link-bandwidth cases, or"none"including rejection of stray link-bandwidth extended communities where required), not merely “any prefix appeared.” IfWatchAlltimes out, the test logs and continues to final validation instead of returning early; the authoritative pass/fail is the subsequent phase. Final validation usesgetV4Prefixes/getV6Prefixes, which callgnmi.LookupAll(empty slice when nothing is present) instead ofgnmi.GetAll(which fatals on empty).Why — the core problem. The original flow was effectively: wait until something shows up with
WatchAll, thenGetAllall prefixes and assert. That mixes change detection with authoritative RIB snapshot. Three failure modes drove flakes: (a) accumulatedExtendedCommunityentries from multiple notifications caused intermittent bandwidth mismatches; (b) for"none", a predicate likelen(prefix.Community) == 0could never succeed if accumulated communities lingered in the STREAM cache; (c) a transient withdrawal between “saw a prefix” andGetAllproduced an empty list and immediate test failure. The fix usesWatchAllonly as convergence signaling andLookupAll(ONCE-style read) for assertions, avoiding the accumulator for final truth2.
enableExtCommunityCLIConfig— remove unnecessary sleep3.
validateImportPolicyDut— consolidated prefix countingWhat changed. Replaced the chain WatchAll (any prefix) →
GetAll→ per-prefixWatchwith subnet checks with a singleWatchAlloverUnicastIpv4PrefixAny().State()whose predicate keeps amap[string]boolof distinct addresses inside the expected subnet (parseV4) and succeeds when three distinct matching prefixes have been seen; on timeout the failure message includes how many were observed. The gate uses a single 2-minute timeout instead of a mix of shorter waits.Why. The old sequence raced: the first
WatchAllonly proved at least one prefix existed, not that the set of three was stable betweenGetAlland per-prefix watches. Counting inside one subscription ties “three prefixes in subnet” to one continuous stream and removes the flakiness4.
validateImportRoutingPolicyAllowAll— Watch instead of Get for policy verificationWhat changed. For IPv4 and IPv6 apply-policy state,
gnmi.Getwas replaced withgnmi.Watch(30s timeout) while keeping the same expectation: exactly one import policy namedallow-all.Why. After
removeImportAndExportPolicyandapplyImportPolicyDut, OpenConfig state can lag config briefly; an immediateGetmay see empty or stale policy names.Watchwaits until the predicate holds or the timeout fires, matching operational “ready when state matches.”5.
checkTraffic— single retry on packet lossWhat changed. Traffic start → 30s run → stop → metrics is wrapped so that if loss exceeds 1% on the first attempt, the test logs once and repeats the whole measurement; only the second attempt’s outcome can produce the final failure. The 1% threshold is unchanged for the attempt that matters.
6.
baseSetupConfigAndVerification— explicit prefix waits before trafficWhat changed. After base BGP setup (and import-policy validation when not skipped by deviations), added
WatchAllwaits (2 minutes) for IPv4 and IPv6 unicast prefixes on the OTG peer before creating traffic flows.Why. Flows assume destinations are already learned on OTG port2; starting traffic earlier can fail for control-plane timing unrelated to link-bandwidth policy correctness.