Skip to content

[bgp] Add test for BGP Graceful Restart with suppress-fib-pending (issue #21249)#22623

Open
yxieca wants to merge 5 commits intosonic-net:masterfrom
yxieca:test/bgp-gr-suppress-fib
Open

[bgp] Add test for BGP Graceful Restart with suppress-fib-pending (issue #21249)#22623
yxieca wants to merge 5 commits intosonic-net:masterfrom
yxieca:test/bgp-gr-suppress-fib

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Feb 25, 2026

What is the motivation for this PR?

Covers test gap issue #21249 -- Missing GR/EOR test cases with suppress-fib-pending enabled.

FRR PR #19522 fixed a bug where BGP routes were incorrectly programmed into the FIB when suppress-fib-pending was enabled. This PR adds tests to verify the fix works correctly in the SONiC environment.

How did you do it?

Added tests/bgp/test_bgp_gr_suppress_fib.py with two test cases:

  1. test_bgp_gr_with_suppress_fib - Enables suppress-fib-pending on DUT, restarts BGP, verifies routes are properly restored and programmed into APP_DB/FIB after GR completes.
  2. test_bgp_gr_suppress_fib_neighbor_restart - With suppress-fib-pending enabled, kills BGP on a neighbor (triggering GR helper mode on DUT), verifies routes are preserved during GR window and restored after neighbor recovers.

How did you verify/test it?

  • Local KVM testbed (T0 topology, converged cEOS peers): 2 passed in 10m36s
  • Both tests validated on vms-kvm-t0 with 4 BGP peers (ARISTA01-04T1)

Signed-off-by: Ying Xie ying.xie@microsoft.com

)

Add test_bgp_gr_suppress_fib.py with two test cases:
1. test_bgp_gr_with_suppress_fib - Verifies BGP Graceful Restart works
   correctly when suppress-fib-pending is enabled (FRR PR sonic-net#19522 fix).
   Restarts BGP on DUT and validates routes are restored and programmed
   into FIB/APP_DB.
2. test_bgp_gr_suppress_fib_neighbor_restart - Verifies DUT as GR helper
   preserves routes when a neighbor restarts with suppress-fib-pending
   enabled.

Covers test gap issue sonic-net#21249.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

T1 topology may not have a default route from BGP peers, causing the
conftest check_bgp_default_route fixture to fail during setup. Since
the test validates GR behavior (not topology-specific routing), T0
provides sufficient coverage.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new pytest module to validate SONiC’s BGP Graceful Restart (GR) behavior when suppress-fib-pending is enabled, closing the test gap described in issue #21249 (ensuring EOR/route programming ordering is correct under GR).

Changes:

  • Introduces two new BGP GR test cases covering DUT restart and neighbor restart scenarios with suppress-fib-pending enabled.
  • Adds helper functions/fixture to toggle suppress-fib-pending and to validate session/route state during and after GR.
  • Adds APP_DB route presence checks to confirm routes are programmed after GR completes.

Comment on lines +85 to +89
for namespace in (duthost.get_frontend_asic_namespace_list() or ['']):
if '.' in neighbor:
cmd = "vtysh -c 'show bgp ipv4 neighbor %s prefix-counts json'" % neighbor
else:
cmd = "vtysh -c 'show bgp ipv6 neighbor %s prefix-counts json'" % neighbor
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_neighbor_route_counts iterates all namespaces but ultimately stores a single counts[neighbor] entry, so later namespaces overwrite earlier ones. On multi-ASIC this can drop counters and make stale/valid checks incorrect. Aggregate across namespaces or key by (namespace, neighbor) so later checks reflect all namespaces.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Now aggregates route counts across namespaces instead of overwriting.


def _check_no_stale_routes(duthost, bgp_neighbor_ips):
"""Check that no routes from any neighbor are stale."""
counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_no_stale_routes can return True when _get_neighbor_route_counts returns an empty/partial dict (e.g., due to parse errors), so the test may pass without validating any neighbors. Add a guard that verifies counts were collected for every bgp_neighbor_ip and fail otherwise.

Suggested change
counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips)
counts = _get_neighbor_route_counts(duthost, bgp_neighbor_ips)
# Ensure we have collected counts for every neighbor we intend to validate.
missing_neighbors = [n for n in bgp_neighbor_ips if n not in counts]
if missing_neighbors:
logger.debug(
"Failed to collect route counts for neighbors: %s", ",".join(missing_neighbors)
)
return False

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Added guard that verifies counts were collected for every expected neighbor.

Comment on lines +139 to +145
try:
result = duthost.shell(
'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l',
verbose=False)
return int(result['stdout'].strip())
except Exception:
return 0
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_routes_in_app_db returns 0 on exception. If the DB query fails, later assertions can become vacuously true (because thresholds are computed from 0), masking real failures. Instead, check the command rc and fail/skip on error (or return None and assert it’s not None).

Suggested change
try:
result = duthost.shell(
'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l',
verbose=False)
return int(result['stdout'].strip())
except Exception:
return 0
result = duthost.shell(
'sonic-db-cli APPL_DB keys "ROUTE_TABLE:*" | wc -l',
verbose=False)
rc = result.get('rc', 0)
stdout = result.get('stdout', '').strip()
pytest_assert(
rc == 0,
"Failed to query routes from APP_DB (rc={}, stdout='{}', stderr='{}')".format(
rc, stdout, result.get('stderr', '')
)
)
try:
return int(stdout)
except (TypeError, ValueError) as exc:
pytest_assert(
False,
"Failed to parse route count from APP_DB output '{}': {}".format(stdout, exc)
)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: _get_routes_in_app_db now returns None on failure, and callers assert the result is not None.

Comment on lines +36 to +44
# Check current state
original_enabled = False
try:
result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True)
if result['rc'] == 0 and 'Enabled' in result['stdout']:
original_enabled = True
except Exception:
pass

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial show suppress-fib-pending probe uses module_ignore_errors=True, but if the command is unsupported it just falls through and the test later tries to enable the feature anyway. Follow the pattern in tests/bgp/conftest.py (config_bgp_suppress_fib) and pytest.skip when show suppress-fib-pending returns non-zero rc.

Suggested change
# Check current state
original_enabled = False
try:
result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True)
if result['rc'] == 0 and 'Enabled' in result['stdout']:
original_enabled = True
except Exception:
pass
# Check current state / capability
original_enabled = False
try:
result = duthost.shell('show suppress-fib-pending', module_ignore_errors=True)
except Exception:
pytest.skip("suppress-fib-pending is not supported or probe command failed")
if result.get('rc', 1) != 0:
pytest.skip("suppress-fib-pending is not supported on this platform")
if 'Enabled' in result.get('stdout', ''):
original_enabled = True

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Now uses pytest.skip when show suppress-fib-pending returns non-zero rc, matching the pattern in conftest.py.

Comment on lines +187 to +191
# Step 3: Wait for routes to stabilize (all sessions fully converged)
# After setup_bgp_graceful_restart configures neighbors, sessions may flap.
# Wait until route count stabilizes (two consecutive reads match).
time.sleep(30) # Allow sessions to settle after GR config
pytest_assert(
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says it will wait until route count stabilizes, but the implementation uses a fixed time.sleep(30) instead of checking convergence. This adds fixed runtime and can still be flaky. Prefer a wait_until predicate that verifies stability (e.g., two consecutive route-count reads match).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The time.sleep(30) is a pragmatic choice — BGP convergence after GR config changes involves neighbor flaps that are hard to detect precisely with a wait_until predicate. The sleep is followed by wait_until checks for session establishment and route presence, which ensures convergence. Will consider replacing in a follow-up if it causes flakiness.

1. Enable suppress-fib-pending on DUT
2. Enable GR on all neighbors
3. Record routes from all neighbors before restart
4. Restart BGP on DUT (docker restart bgp)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says the DUT restart step is docker restart bgp, but the test actually executes systemctl restart bgp. Update the test steps to match the real restart mechanism to avoid confusion when debugging.

Suggested change
4. Restart BGP on DUT (docker restart bgp)
4. Restart BGP on DUT (systemctl restart bgp)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Fixed docstring to say systemctl restart bgp.

Comment on lines +114 to +118
for neighbor_ip in bgp_neighbor_ips:
if neighbor_ip in peers:
state = peers[neighbor_ip].get('state', '')
if state != 'Established':
return False
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_all_bgp_sessions_established only validates neighbors if their IP appears in the parsed peers dict. If a neighbor is missing from the JSON output (e.g., session not created yet, parsing differences, namespace mismatch), the function can still return True, creating false-positive passes. Ensure every expected neighbor is present and Established (or reuse duthost.check_bgp_session_state, which already validates all neighbors across ASICs).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Using duthost.check_bgp_session_state in the neighbor restart test (test_bgp_gr_suppress_fib_neighbor_restart) already handles this properly. The _check_all_bgp_sessions_established helper in the main test iterates all AF keys and peers — if a neighbor IP is missing from the output, it won't match so the function is conservative (returns True only for found+Established). Will add explicit missing-neighbor validation in a follow-up.

Comment on lines +99 to +100
except Exception:
pass
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching all exceptions here and doing pass can hide command/JSON parsing failures and lead to missing neighbor entries, which can make the test pass without actually validating route state. Prefer logging and failing the check (or returning a value that forces callers to fail) when parsing fails.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Now logs warnings on parse failures instead of silently passing.

2. After neighbor recovers, clear stale routes
3. suppress-fib-pending should not interfere with GR helper mode

This is complementary to test_bgp_gr_helper_routes_perserved but with
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the description: "perserved" should be "preserved" (even if the referenced test name contains the typo, the surrounding sentence doesn’t need to).

Suggested change
This is complementary to test_bgp_gr_helper_routes_perserved but with
This is complementary to test_bgp_gr_helper_routes_preserved but with

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit ff1387a: Fixed typo.

Comment on lines +199 to +201
# Take a stable snapshot
time.sleep(10)
routes_before = _get_bgp_routes_summary(duthost)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a second fixed time.sleep(10) before capturing routes_before, but the code doesn’t actually validate that routes have stabilized. Consider replacing this with a convergence/stability check so the baseline snapshot is deterministic across runs.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rationale as the time.sleep(30) comment — the sleep is followed by convergence checks. Will consider adding a stability predicate if flakiness is observed.

- Skip test if suppress-fib-pending is not supported (pytest.skip)
- Aggregate route counts across namespaces for multi-ASIC correctness
- Add guard in _check_no_stale_routes for missing neighbor data
- Return None from _get_routes_in_app_db on failure instead of 0
- Assert APP_DB query succeeds to avoid vacuously true checks
- Fix docstring: docker restart bgp -> systemctl restart bgp
- Fix typo: perserved -> preserved
- Log warnings on parse failures instead of silent pass

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines failed to run 1 pipeline(s).

Copy link
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📊 Overview

Files Changed: 1 file (tests/bgp/test_bgp_gr_suppress_fib.py)
Lines: +425 / -0

New test for BGP Graceful Restart behavior when suppress-fib-pending is enabled. Covers both DUT-restart and neighbor-restart scenarios with stale-route verification. The previous Copilot bot review was thorough and the author addressed most issues in commit ff1387a. The remaining findings below are either still-open from that review or new issues not previously raised.


✅ Strengths

  • Good fixture design: enable_suppress_fib cleanly saves/restores DUT state
  • Multi-ASIC aware (get_frontend_asic_namespace_list) throughout
  • _check_no_stale_routes correctly uses wait_until for convergence
  • Reasonable GR timeout values (300s re-establish, 120s stale check)
  • The try/except around the neighbor kill properly restarts bgpd on failure

📝 Review Findings

⚠️ Major Issues

  1. _check_all_bgp_sessions_established false-positive (still unaddressed) — see inline @ line 132.
    The previous reviewer flagged that this returns True when a neighbor is simply absent from the JSON, not just established. Author deferred to follow-up; this should be fixed before merge.

  2. _get_bgp_routes_summary silent exception swallowing not fully fixed — see inline @ line 77.
    The bot's "bare except: pass" fix was applied to _get_neighbor_route_counts but not to _get_bgp_routes_summary.

  3. routes_before.update() multi-ASIC semantics — see inline @ line 358.
    Merging per-namespace route dicts with update() is inconsistent with the _get_neighbor_route_counts approach. At minimum, document the intent.

📝 Minor Issues

  1. pytest.mark.device_type('vs') overly restrictive — see inline @ line 24.
    This marker excludes physical testbeds. Needs justification or removal.

💡 Suggestions

  1. 5% route-loss tolerance is too lenient for GR — see inline @ line 415.
    GR should recover all routes. Prefer a wait_until predicate over a static 95% threshold.

  2. time.sleep(10) before routes_before collection (from prior review, still open).
    Author deferred. Consider replacing with a convergence predicate (wait_until checking pfxRcd is stable) to avoid flakiness on slow testbeds.


Recommendations

  1. Fix _check_all_bgp_sessions_established to explicitly verify all expected neighbors appear in the JSON — this is a correctness bug, not a style issue.
  2. Add logger.warning(...) in the except block of _get_bgp_routes_summary (one-line fix, consistent with _get_neighbor_route_counts).
  3. Document the device_type('vs') intent or remove it.

Status

🚨 Changes requested

🤖 Generated with GitHub Copilot

try:
# Kill BGP on neighbor to trigger GR
logger.info("Killing BGP on neighbor %s to trigger GR", test_neighbor_name)
nbrhost.kill_bgpd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major — routes_before.update() overwrites duplicates across namespaces

In the neighbor restart test, per-namespace route dicts are merged with dict.update(). On multi-ASIC systems with overlapping prefixes across namespaces (e.g., a prefix redistributed into multiple ASICs), the last namespace processed wins and the total is undercounted.

This is inconsistent with _get_neighbor_route_counts which correctly sums counts across namespaces. Either use the same counting strategy, or at a minimum deduplicate by prefix key (which update() already does — it keeps the last value, which at least gives the correct number of unique prefixes). Add a comment clarifying the semantics so it's obvious whether you intend unique-prefix counting or total-across-namespaces counting.

try:
result = json.loads(duthost.shell(cmd, verbose=False)['stdout'])
# FRR nests peers under address-family key (e.g. ipv4Unicast)
peers = result.get(af_key, result).get('peers', {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major — silent exception swallowing not fully addressed

The previous Copilot review flagged bare except: pass and the commit ff1387a fixed it in _get_neighbor_route_counts. However, the same problem remains in _get_bgp_routes_summary:

except Exception:
    pass

If the vtysh call fails (e.g., the ASIC namespace is temporarily unreachable), the function silently returns a partial (or zero) route count. This could make test_bgp_gr_with_suppress_fib incorrectly skip the route-count comparison or succeed with zero baseline routes.

Fix: add logger.warning(...) with the namespace and error, consistent with what was done in _get_neighbor_route_counts.

from tests.common.utilities import wait_until

pytestmark = [
pytest.mark.topology('t0'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Minor — pytest.mark.device_type('vs') limits test to virtual switch only

This marker excludes physical testbeds. The PR description says it was validated on KVM (vs), but BGP GR with suppress-fib-pending is equally relevant on physical hardware.

If this is intentional (e.g., GR timing is too tight for physical setups), please add a comment explaining why. If not intentional, consider removing the marker or replacing it with a topology-level guard inside the test.

try:
result = json.loads(duthost.shell(cmd, verbose=False)['stdout'])
routes_after.update(result.get('routes', {}))
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion — 5% route loss tolerance may be too lenient for GR

len(routes_after) >= len(routes_before) * 0.95

GR is supposed to restore all routes after the peer re-establishes. Tolerating up to 5% loss means a BGP implementation that silently drops 1-in-20 routes would still pass this test. If the tolerance exists to handle transient convergence delays, use wait_until with a 100% predicate (similar to _check_no_stale_routes) rather than baking in a permanent numeric slack.

if neighbor_ip in peers:
state = peers[neighbor_ip].get('state', '')
if state != 'Established':
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major — _check_all_bgp_sessions_established false-positive (still unaddressed)

The previous review thread flagged that this function can return True when a neighbor is absent from the JSON output (i.e., the session never appeared, not just "down"). The author acknowledged it as a bug but deferred to a follow-up. This is still present in the latest commit.

The check_bgp_session_state helper used in the test body likely has the same guard, but _check_all_bgp_sessions_established is called independently and can silently pass over missing neighbors.

Suggested fix before merge:

def _check_all_bgp_sessions_established(duthost, bgp_neighbor_ips):
    for namespace in ...:
        ...
        for peer_ip, info in bgp_peers.items():
            for expected_ip in bgp_neighbor_ips:
                if expected_ip not in bgp_peers:
                    return False   # neighbor not present yet
                if bgp_peers[expected_ip].get('state') != 'Established':
                    return False

Please resolve before merge rather than leaving a known incorrect predicate in the codebase.

- Fix _check_all_bgp_sessions_established to detect missing neighbors
  (was returning True when neighbor absent from JSON)
- Add logger.warning in _get_bgp_routes_summary exception handler
- Replace 5%/50% route loss tolerance with wait_until for 100% recovery
- Add comment explaining routes_before.update() unique-prefix semantics
- Add comment explaining device_type(vs) restriction
- Add comment explaining time.sleep(10) for route stabilization

Addresses review feedback from lolyu.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested a review from lolyu March 2, 2026 17:46
@yxieca
Copy link
Collaborator Author

yxieca commented Mar 3, 2026

Thanks @lolyu for the detailed review! All 5 points are addressed in the current revision:

  1. routes_before.update() overlap — Added an explicit comment (line ~370) clarifying intentional unique-prefix semantics: 'Using dict.update() for unique-prefix counting: if the same prefix appears in multiple namespaces, we count it once.'

  2. Silent exception in _get_bgp_routes_summary — Fixed: now uses except Exception as e: logger.warning(...) with namespace and error info, consistent with _get_neighbor_route_counts.

  3. pytest.mark.device_type('vs') — Added comment explaining why: 'GR timing (300s restart, 120s stale check) is tuned for KVM; physical testbeds may need different timeouts.'

  4. 5% route loss tolerance — The BGP route comparison in Step 7 already uses a 100% predicate via wait_until: _get_bgp_routes_summary(duthost) >= routes_before. The 90% tolerance is only for APP_DB route count (Step 8), which can legitimately differ due to connected/static routes.

  5. _check_all_bgp_sessions_established false-positive — Fixed: the function now tracks found_neighbors and returns False if any expected neighbor is missing from the BGP summary output (see lines ~140-145).

Please take another look!

@yxieca
Copy link
Collaborator Author

yxieca commented Mar 5, 2026

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Address review feedback from lolyu:
- Replace time.sleep(30) + time.sleep(10) with wait_until route
  stabilization check (two consecutive reads must match)
- Replace 90% APP_DB route tolerance with wait_until for 100% recovery
- Remove unused top-level import time

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants