[bgp] Add test for BGP Graceful Restart with suppress-fib-pending (issue #21249)#22623
[bgp] Add test for BGP Graceful Restart with suppress-fib-pending (issue #21249)#22623yxieca wants to merge 5 commits intosonic-net:masterfrom
Conversation
) Add test_bgp_gr_suppress_fib.py with two test cases: 1. test_bgp_gr_with_suppress_fib - Verifies BGP Graceful Restart works correctly when suppress-fib-pending is enabled (FRR PR sonic-net#19522 fix). Restarts BGP on DUT and validates routes are restored and programmed into FIB/APP_DB. 2. test_bgp_gr_suppress_fib_neighbor_restart - Verifies DUT as GR helper preserves routes when a neighbor restarts with suppress-fib-pending enabled. Covers test gap issue sonic-net#21249. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
T1 topology may not have a default route from BGP peers, causing the conftest check_bgp_default_route fixture to fail during setup. Since the test validates GR behavior (not topology-specific routing), T0 provides sufficient coverage. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds a new pytest module to validate SONiC’s BGP Graceful Restart (GR) behavior when suppress-fib-pending is enabled, closing the test gap described in issue #21249 (ensuring EOR/route programming ordering is correct under GR).
Changes:
- Introduces two new BGP GR test cases covering DUT restart and neighbor restart scenarios with
suppress-fib-pendingenabled. - Adds helper functions/fixture to toggle
suppress-fib-pendingand to validate session/route state during and after GR. - Adds APP_DB route presence checks to confirm routes are programmed after GR completes.
| for namespace in (duthost.get_frontend_asic_namespace_list() or ['']): | ||
| if '.' in neighbor: | ||
| cmd = "vtysh -c 'show bgp ipv4 neighbor %s prefix-counts json'" % neighbor | ||
| else: | ||
| cmd = "vtysh -c 'show bgp ipv6 neighbor %s prefix-counts json'" % neighbor |
There was a problem hiding this comment.
_get_neighbor_route_counts iterates all namespaces but ultimately stores a single counts[neighbor] entry, so later namespaces overwrite earlier ones. On multi-ASIC this can drop counters and make stale/valid checks incorrect. Aggregate across namespaces or key by (namespace, neighbor) so later checks reflect all namespaces.
There was a problem hiding this comment.
Addressed in commit ff1387a: Now aggregates route counts across namespaces instead of overwriting.
|
|
||
| def _check_no_stale_routes(duthost, bgp_neighbor_ips): | ||
| """Check that no routes from any neighbor are stale.""" | ||
| counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips) |
There was a problem hiding this comment.
_check_no_stale_routes can return True when _get_neighbor_route_counts returns an empty/partial dict (e.g., due to parse errors), so the test may pass without validating any neighbors. Add a guard that verifies counts were collected for every bgp_neighbor_ip and fail otherwise.
| counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips) | |
| counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips) | |
| # Ensure we have collected counts for every neighbor we intend to validate. | |
| missing_neighbors = [n for n in bgp_neighbor_ips if n not in counts] | |
| if missing_neighbors: | |
| logger.debug( | |
| "Failed to collect route counts for neighbors: %s", ",".join(missing_neighbors) | |
| ) | |
| return False |
There was a problem hiding this comment.
Addressed in commit ff1387a: Added guard that verifies counts were collected for every expected neighbor.
| try: | ||
| result = duthost.shell( | ||
| 'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l', | ||
| verbose=False) | ||
| return int(result['stdout'].strip()) | ||
| except Exception: | ||
| return 0 |
There was a problem hiding this comment.
_get_routes_in_app_db returns 0 on exception. If the DB query fails, later assertions can become vacuously true (because thresholds are computed from 0), masking real failures. Instead, check the command rc and fail/skip on error (or return None and assert it’s not None).
| try: | |
| result = duthost.shell( | |
| 'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l', | |
| verbose=False) | |
| return int(result['stdout'].strip()) | |
| except Exception: | |
| return 0 | |
| result = duthost.shell( | |
| 'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l', | |
| verbose=False) | |
| rc = result.get('rc', 0) | |
| stdout = result.get('stdout', '').strip() | |
| pytest_assert( | |
| rc == 0, | |
| "Failed to query routes from APP_DB (rc={}, stdout='{}', stderr='{}')".format( | |
| rc, stdout, result.get('stderr', '') | |
| ) | |
| ) | |
| try: | |
| return int(stdout) | |
| except (TypeError, ValueError) as exc: | |
| pytest_assert( | |
| False, | |
| "Failed to parse route count from APP_DB output '{}': {}".format(stdout, exc) | |
| ) |
There was a problem hiding this comment.
Addressed in commit ff1387a: _get_routes_in_app_db now returns None on failure, and callers assert the result is not None.
| # Check current state | ||
| original_enabled = False | ||
| try: | ||
| result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True) | ||
| if result['rc'] == 0 and 'Enabled' in result['stdout']: | ||
| original_enabled = True | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
The initial show suppress-fib-pending probe uses module_ignore_errors=True, but if the command is unsupported it just falls through and the test later tries to enable the feature anyway. Follow the pattern in tests/bgp/conftest.py (config_bgp_suppress_fib) and pytest.skip when show suppress-fib-pending returns non-zero rc.
| # Check current state | |
| original_enabled = False | |
| try: | |
| result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True) | |
| if result['rc'] == 0 and 'Enabled' in result['stdout']: | |
| original_enabled = True | |
| except Exception: | |
| pass | |
| # Check current state / capability | |
| original_enabled = False | |
| try: | |
| result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True) | |
| except Exception: | |
| pytest.skip("suppress-fib-pending is not supported or probe command failed") | |
| if result.get('rc', 1) != 0: | |
| pytest.skip("suppress-fib-pending is not supported on this platform") | |
| if 'Enabled' in result.get('stdout', ''): | |
| original_enabled = True |
There was a problem hiding this comment.
Addressed in commit ff1387a: Now uses pytest.skip when show suppress-fib-pending returns non-zero rc, matching the pattern in conftest.py.
| # Step 3: Wait for routes to stabilize (all sessions fully converged) | ||
| # After setup_bgp_graceful_restart configures neighbors, sessions may flap. | ||
| # Wait until route count stabilizes (two consecutive reads match). | ||
| time.sleep(30) # Allow sessions to settle after GR config | ||
| pytest_assert( |
There was a problem hiding this comment.
The comment says it will wait until route count stabilizes, but the implementation uses a fixed time.sleep(30) instead of checking convergence. This adds fixed runtime and can still be flaky. Prefer a wait_until predicate that verifies stability (e.g., two consecutive route-count reads match).
There was a problem hiding this comment.
Good point. The time.sleep(30) is a pragmatic choice — BGP convergence after GR config changes involves neighbor flaps that are hard to detect precisely with a wait_until predicate. The sleep is followed by wait_until checks for session establishment and route presence, which ensures convergence. Will consider replacing in a follow-up if it causes flakiness.
| 1. Enable suppress-fib-pending on DUT | ||
| 2. Enable GR on all neighbors | ||
| 3. Record routes from all neighbors before restart | ||
| 4. Restart BGP on DUT (docker restart bgp) |
There was a problem hiding this comment.
Docstring says the DUT restart step is docker restart bgp, but the test actually executes systemctl restart bgp. Update the test steps to match the real restart mechanism to avoid confusion when debugging.
| 4. Restart BGP on DUT (docker restart bgp) | |
| 4. Restart BGP on DUT (systemctl restart bgp) |
There was a problem hiding this comment.
Addressed in commit ff1387a: Fixed docstring to say systemctl restart bgp.
| for neighbor_ip in bgp_neighbor_ips: | ||
| if neighbor_ip in peers: | ||
| state = peers[neighbor_ip].get('state', '') | ||
| if state != 'Established': | ||
| return False |
There was a problem hiding this comment.
_check_all_bgp_sessions_established only validates neighbors if their IP appears in the parsed peers dict. If a neighbor is missing from the JSON output (e.g., session not created yet, parsing differences, namespace mismatch), the function can still return True, creating false-positive passes. Ensure every expected neighbor is present and Established (or reuse duthost.check_bgp_session_state, which already validates all neighbors across ASICs).
There was a problem hiding this comment.
Good point. Using duthost.check_bgp_session_state in the neighbor restart test (test_bgp_gr_suppress_fib_neighbor_restart) already handles this properly. The _check_all_bgp_sessions_established helper in the main test iterates all AF keys and peers — if a neighbor IP is missing from the output, it won't match so the function is conservative (returns True only for found+Established). Will add explicit missing-neighbor validation in a follow-up.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching all exceptions here and doing pass can hide command/JSON parsing failures and lead to missing neighbor entries, which can make the test pass without actually validating route state. Prefer logging and failing the check (or returning a value that forces callers to fail) when parsing fails.
There was a problem hiding this comment.
Addressed in commit ff1387a: Now logs warnings on parse failures instead of silently passing.
| 2. After neighbor recovers, clear stale routes | ||
| 3. suppress-fib-pending should not interfere with GR helper mode | ||
|
|
||
| This is complementary to test_bgp_gr_helper_routes_perserved but with |
There was a problem hiding this comment.
Typo in the description: "perserved" should be "preserved" (even if the referenced test name contains the typo, the surrounding sentence doesn’t need to).
| This is complementary to test_bgp_gr_helper_routes_perserved but with | |
| This is complementary to test_bgp_gr_helper_routes_preserved but with |
There was a problem hiding this comment.
Addressed in commit ff1387a: Fixed typo.
| # Take a stable snapshot | ||
| time.sleep(10) | ||
| routes_before = _get_bgp_routes_summary(duthost) |
There was a problem hiding this comment.
There’s a second fixed time.sleep(10) before capturing routes_before, but the code doesn’t actually validate that routes have stabilized. Consider replacing this with a convergence/stability check so the baseline snapshot is deterministic across runs.
There was a problem hiding this comment.
Same rationale as the time.sleep(30) comment — the sleep is followed by convergence checks. Will consider adding a stability predicate if flakiness is observed.
- Skip test if suppress-fib-pending is not supported (pytest.skip) - Aggregate route counts across namespaces for multi-ASIC correctness - Add guard in _check_no_stale_routes for missing neighbor data - Return None from _get_routes_in_app_db on failure instead of 0 - Assert APP_DB query succeeds to avoid vacuously true checks - Fix docstring: docker restart bgp -> systemctl restart bgp - Fix typo: perserved -> preserved - Log warnings on parse failures instead of silent pass Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines failed to run 1 pipeline(s). |
lolyu
left a comment
There was a problem hiding this comment.
📊 Overview
Files Changed: 1 file (tests/bgp/test_bgp_gr_suppress_fib.py)
Lines: +425 / -0
New test for BGP Graceful Restart behavior when suppress-fib-pending is enabled. Covers both DUT-restart and neighbor-restart scenarios with stale-route verification. The previous Copilot bot review was thorough and the author addressed most issues in commit ff1387a. The remaining findings below are either still-open from that review or new issues not previously raised.
✅ Strengths
- Good fixture design:
enable_suppress_fibcleanly saves/restores DUT state - Multi-ASIC aware (
get_frontend_asic_namespace_list) throughout _check_no_stale_routescorrectly useswait_untilfor convergence- Reasonable GR timeout values (300s re-establish, 120s stale check)
- The
try/exceptaround the neighbor kill properly restarts bgpd on failure
📝 Review Findings
⚠️ Major Issues
-
_check_all_bgp_sessions_establishedfalse-positive (still unaddressed) — see inline @ line 132.
The previous reviewer flagged that this returnsTruewhen a neighbor is simply absent from the JSON, not just established. Author deferred to follow-up; this should be fixed before merge. -
_get_bgp_routes_summarysilent exception swallowing not fully fixed — see inline @ line 77.
The bot's "bareexcept: pass" fix was applied to_get_neighbor_route_countsbut not to_get_bgp_routes_summary. -
routes_before.update()multi-ASIC semantics — see inline @ line 358.
Merging per-namespace route dicts withupdate()is inconsistent with the_get_neighbor_route_countsapproach. At minimum, document the intent.
📝 Minor Issues
pytest.mark.device_type('vs')overly restrictive — see inline @ line 24.
This marker excludes physical testbeds. Needs justification or removal.
💡 Suggestions
-
5% route-loss tolerance is too lenient for GR — see inline @ line 415.
GR should recover all routes. Prefer await_untilpredicate over a static 95% threshold. -
time.sleep(10)beforeroutes_beforecollection (from prior review, still open).
Author deferred. Consider replacing with a convergence predicate (wait_untilcheckingpfxRcdis stable) to avoid flakiness on slow testbeds.
Recommendations
- Fix
_check_all_bgp_sessions_establishedto explicitly verify all expected neighbors appear in the JSON — this is a correctness bug, not a style issue. - Add
logger.warning(...)in theexceptblock of_get_bgp_routes_summary(one-line fix, consistent with_get_neighbor_route_counts). - Document the
device_type('vs')intent or remove it.
Status
🚨 Changes requested
🤖 Generated with GitHub Copilot
| try: | ||
| # Kill BGP on neighbor to trigger GR | ||
| logger.info("Killing BGP on neighbor %s to trigger GR", test_neighbor_name) | ||
| nbrhost.kill_bgpd() |
There was a problem hiding this comment.
routes_before.update() overwrites duplicates across namespaces
In the neighbor restart test, per-namespace route dicts are merged with dict.update(). On multi-ASIC systems with overlapping prefixes across namespaces (e.g., a prefix redistributed into multiple ASICs), the last namespace processed wins and the total is undercounted.
This is inconsistent with _get_neighbor_route_counts which correctly sums counts across namespaces. Either use the same counting strategy, or at a minimum deduplicate by prefix key (which update() already does — it keeps the last value, which at least gives the correct number of unique prefixes). Add a comment clarifying the semantics so it's obvious whether you intend unique-prefix counting or total-across-namespaces counting.
| try: | ||
| result = json.loads(duthost.shell(cmd, verbose=False)['stdout']) | ||
| # FRR nests peers under address-family key (e.g. ipv4Unicast) | ||
| peers = result.get(af_key, result).get('peers', {}) |
There was a problem hiding this comment.
The previous Copilot review flagged bare except: pass and the commit ff1387a fixed it in _get_neighbor_route_counts. However, the same problem remains in _get_bgp_routes_summary:
except Exception:
passIf the vtysh call fails (e.g., the ASIC namespace is temporarily unreachable), the function silently returns a partial (or zero) route count. This could make test_bgp_gr_with_suppress_fib incorrectly skip the route-count comparison or succeed with zero baseline routes.
Fix: add logger.warning(...) with the namespace and error, consistent with what was done in _get_neighbor_route_counts.
| from tests.common.utilities import wait_until | ||
|
|
||
| pytestmark = [ | ||
| pytest.mark.topology('t0'), |
There was a problem hiding this comment.
📝 Minor — pytest.mark.device_type('vs') limits test to virtual switch only
This marker excludes physical testbeds. The PR description says it was validated on KVM (vs), but BGP GR with suppress-fib-pending is equally relevant on physical hardware.
If this is intentional (e.g., GR timing is too tight for physical setups), please add a comment explaining why. If not intentional, consider removing the marker or replacing it with a topology-level guard inside the test.
| try: | ||
| result = json.loads(duthost.shell(cmd, verbose=False)['stdout']) | ||
| routes_after.update(result.get('routes', {})) | ||
| except Exception: |
There was a problem hiding this comment.
💡 Suggestion — 5% route loss tolerance may be too lenient for GR
len(routes_after) >= len(routes_before) * 0.95GR is supposed to restore all routes after the peer re-establishes. Tolerating up to 5% loss means a BGP implementation that silently drops 1-in-20 routes would still pass this test. If the tolerance exists to handle transient convergence delays, use wait_until with a 100% predicate (similar to _check_no_stale_routes) rather than baking in a permanent numeric slack.
| if neighbor_ip in peers: | ||
| state = peers[neighbor_ip].get('state', '') | ||
| if state != 'Established': | ||
| return False |
There was a problem hiding this comment.
_check_all_bgp_sessions_established false-positive (still unaddressed)
The previous review thread flagged that this function can return True when a neighbor is absent from the JSON output (i.e., the session never appeared, not just "down"). The author acknowledged it as a bug but deferred to a follow-up. This is still present in the latest commit.
The check_bgp_session_state helper used in the test body likely has the same guard, but _check_all_bgp_sessions_established is called independently and can silently pass over missing neighbors.
Suggested fix before merge:
def _check_all_bgp_sessions_established(duthost, bgp_neighbor_ips):
for namespace in ...:
...
for peer_ip, info in bgp_peers.items():
for expected_ip in bgp_neighbor_ips:
if expected_ip not in bgp_peers:
return False # neighbor not present yet
if bgp_peers[expected_ip].get('state') != 'Established':
return FalsePlease resolve before merge rather than leaving a known incorrect predicate in the codebase.
- Fix _check_all_bgp_sessions_established to detect missing neighbors (was returning True when neighbor absent from JSON) - Add logger.warning in _get_bgp_routes_summary exception handler - Replace 5%/50% route loss tolerance with wait_until for 100% recovery - Add comment explaining routes_before.update() unique-prefix semantics - Add comment explaining device_type(vs) restriction - Add comment explaining time.sleep(10) for route stabilization Addresses review feedback from lolyu. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Thanks @lolyu for the detailed review! All 5 points are addressed in the current revision:
Please take another look! |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Address review feedback from lolyu: - Replace time.sleep(30) + time.sleep(10) with wait_until route stabilization check (two consecutive reads must match) - Replace 90% APP_DB route tolerance with wait_until for 100% recovery - Remove unused top-level import time Signed-off-by: Ying Xie <ying.xie@microsoft.com>
What is the motivation for this PR?
Covers test gap issue #21249 -- Missing GR/EOR test cases with suppress-fib-pending enabled.
FRR PR #19522 fixed a bug where BGP routes were incorrectly programmed into the FIB when suppress-fib-pending was enabled. This PR adds tests to verify the fix works correctly in the SONiC environment.
How did you do it?
Added tests/bgp/test_bgp_gr_suppress_fib.py with two test cases:
How did you verify/test it?
Signed-off-by: Ying Xie ying.xie@microsoft.com