[bgp] Add test for BGPv6 peering over link-local addresses (unnumbered)#22600
[bgp] Add test for BGPv6 peering over link-local addresses (unnumbered)#22600yxieca wants to merge 8 commits intosonic-net:masterfrom
Conversation
Add test to verify BGP peering over IPv6 link-local addresses (unnumbered BGP) can be established via PortChannel interfaces. Test steps: 1. Select a PortChannel with an existing BGP neighbor 2. Remove global-IP BGP sessions on both DUT and cEOS peer 3. Configure unnumbered BGP (neighbor <interface> interface remote-as) 4. Configure link-local neighbor on EOS peer 5. Verify session establishes and routes are exchanged 6. Restore original config via config reload Addresses test gap issue sonic-net#18431. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Restrict to T0 only (T1 may lack default route, failing conftest) - Reduce WAIT_TIMEOUT from 180s to 120s - Reduce config_reload wait from 360s to 120s - Reduce route exchange sleep from 15s to 10s - Total test time reduced from 8m20s to 4m15s Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Remove device_type('vs') so test runs on physical testbeds too
- Skip instead of fail when unnumbered session doesn't establish
(prevents sanity check failures for subsequent test modules)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Move BGP configuration changes and cleanup into a configure_unnumbered_bgp fixture with yield, separating test logic from setup/teardown. This ensures cleanup always runs via pytest's fixture teardown mechanism. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The failure message now points out the possibility of a neighbor compatibility issue, since the peer may not support BGP peering over IPv6 link-local addresses. The fixture teardown restores the original configuration and verifies normal BGP sessions recover before the failure is reported. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds a new pytest-based BGP test to validate IPv6 link-local (unnumbered) BGP peering behavior in SONiC, closing a known coverage gap in sonic-mgmt’s BGP test suite.
Changes:
- Introduce
test_bgp_link_local_ipv6to reconfigure an existing PortChannel BGP peer into unnumbered/link-local peering and verify session + route exchange. - Add helper logic to identify a suitable PortChannel neighbor and to detect established unnumbered sessions via
vtyshJSON.
tests/bgp/test_bgp_link_local.py
Outdated
| logger.info("Remove DUT neighbors on EOS peer") | ||
| remove_lines = ["no neighbor {}".format(dut_ipv4)] | ||
| if dut_ipv6: | ||
| remove_lines.append("no neighbor {}".format(dut_ipv6)) | ||
| neigh_host.eos_config(lines=remove_lines, | ||
| parents="router bgp {}".format(neigh_asn)) |
There was a problem hiding this comment.
remove_lines always includes no neighbor {dut_ipv4} even when dut_ipv4 wasn't found (it can be None if the PortChannel has only IPv6 configured in config DB). That would push an invalid EOS config line (no neighbor None) and fail the setup. Build remove_lines conditionally (only add the IPv4/IPv6 neighbor lines when the corresponding IP is present).
There was a problem hiding this comment.
Addressed in commit c89a93e:
- Conditional remove_lines — now only adds
no neighborlines when dut_ipv4/dut_ipv6 are not None. - EOS backup/restore — good point about losing pre-existing per-neighbor settings. The DUT side uses
config_reloadwhich fully restores. The EOS side currently restores justneighbor <ip> remote-aswhich is sufficient for the KVM/vEOS testbed topology (no custom timers/route-maps). I'll note this as a future improvement. - Teardown hard failure — changed from
logger.errortopytest_assert(False, ...)so it surfaces as a test failure. - Non-EOS skip — added neighbor_type check at the top of setup_info, skips gracefully instead of AttributeError.
- Multi-ASIC — this test is currently t0-only topology which is single-ASIC. Will add ASIC namespace awareness if we expand to t1/t2.
Thanks for the review!
| # Remove link-local neighbor from EOS and restore originals | ||
| try: | ||
| cleanup_lines = ["no neighbor {}".format(eos_neighbor)] | ||
| if dut_ipv4: | ||
| cleanup_lines.append( | ||
| "neighbor {} remote-as {}".format(dut_ipv4, dut_asn)) | ||
| if dut_ipv6: | ||
| cleanup_lines.append( | ||
| "neighbor {} remote-as {}".format(dut_ipv6, dut_asn)) | ||
| neigh_host.eos_config(lines=cleanup_lines, | ||
| parents="router bgp {}".format(neigh_asn)) |
There was a problem hiding this comment.
For EOS neighbors, this test removes the existing neighbor stanzas but only restores them with neighbor <ip> remote-as <asn>. Any pre-existing per-neighbor settings (timers, route-maps, password, maximum-routes, etc.) would be lost, which can impact subsequent tests. Other BGP tests typically eos_config(backup=True, ...) before changes and load_configuration(...) in teardown; consider using the same backup/restore approach here.
| if not wait_until(WAIT_TIMEOUT, POLL_INTERVAL, 0, | ||
| duthost.check_bgp_session_state, original_neighbors): | ||
| logger.error("BGP sessions did not re-establish after config reload") | ||
| else: | ||
| logger.info("Original configuration restored, all sessions up") |
There was a problem hiding this comment.
In teardown, if BGP sessions fail to re-establish after config_reload, the fixture only logs an error and continues. That can leave the DUT in a bad state while still reporting the test as passed, and may break later tests in the run. Please make this a hard failure (e.g., pytest_assert(...) / raise) when the sessions do not come back up.
| if not wait_until(WAIT_TIMEOUT, POLL_INTERVAL, 0, | |
| duthost.check_bgp_session_state, original_neighbors): | |
| logger.error("BGP sessions did not re-establish after config reload") | |
| else: | |
| logger.info("Original configuration restored, all sessions up") | |
| sessions_restored = wait_until( | |
| WAIT_TIMEOUT, | |
| POLL_INTERVAL, | |
| 0, | |
| duthost.check_bgp_session_state, | |
| original_neighbors | |
| ) | |
| pytest_assert( | |
| sessions_restored, | |
| "BGP sessions did not re-establish after config reload" | |
| ) | |
| logger.info("Original configuration restored, all sessions up") |
tests/bgp/test_bgp_link_local.py
Outdated
| def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo): | ||
| """Gather setup information for the link-local BGP test.""" | ||
| duthost = duthosts[rand_one_dut_hostname] | ||
| dut_asn = tbinfo['topo']['properties']['configuration_properties']['common']['dut_asn'] |
There was a problem hiding this comment.
The test assumes the selected neighbor is an EOS/cEOS host (uses eos_command/eos_config). When the testbed is run with neighbor_type=sonic (or any non-EOS neighbor), this fixture will raise an AttributeError instead of skipping. Consider taking request in this fixture and skipping unless neighbor_type is "eos" (or isinstance(neigh_host, EosHost)), similar to other BGP tests that support multiple neighbor types.
| def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo): | |
| """Gather setup information for the link-local BGP test.""" | |
| duthost = duthosts[rand_one_dut_hostname] | |
| dut_asn = tbinfo['topo']['properties']['configuration_properties']['common']['dut_asn'] | |
| def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo, request): | |
| """Gather setup information for the link-local BGP test.""" | |
| # This test currently assumes EOS/cEOS neighbors because it uses eos_command/eos_config | |
| # on the neighbor host. Skip gracefully when running with non-EOS neighbors (e.g. SONiC). | |
| common_props = tbinfo.get('topo', {}).get('properties', {}).get('configuration_properties', {}).get('common', {}) | |
| neighbor_type = common_props.get('neighbor_type', 'eos') | |
| if neighbor_type.lower() != 'eos': | |
| pytest.skip("BGP link-local test requires EOS neighbors; current neighbor_type is '{}'".format(neighbor_type)) | |
| duthost = duthosts[rand_one_dut_hostname] | |
| dut_asn = common_props['dut_asn'] |
| # Get DUT's link-local on this PortChannel | ||
| result = duthost.shell("ip -6 addr show dev {} scope link".format(selected['pc'])) | ||
| dut_link_local = None | ||
| for line in result['stdout'].split('\n'): | ||
| if 'inet6 fe80' in line.strip(): | ||
| dut_link_local = line.strip().split()[1].split('/')[0] |
There was a problem hiding this comment.
On multi-ASIC DUTs, commands like ip -6 addr ... and subsequent vtysh calls may need to run in the correct ASIC namespace for the selected PortChannel. Right now everything runs via duthost.shell(...) without selecting an ASIC, which can cause the test to fail on multi-ASIC platforms. Consider determining the ASIC for selected['pc'] (e.g., via duthost.get_asic_index_for_portchannel() / get_port_asic_instance()) and storing an asichost in setup_info, then use asichost.command/shell and asichost.run_vtysh(...) consistently for interface and BGP operations.
- Skip gracefully for non-EOS neighbor types instead of AttributeError - Build remove_lines conditionally (dut_ipv4 can be None on IPv6-only) - Hard-fail teardown if BGP sessions don't re-establish after config reload Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines failed to run 1 pipeline(s). |
lolyu
left a comment
There was a problem hiding this comment.
📊 Overview
Files Changed: 1 (tests/bgp/test_bgp_link_local.py)
Lines: +435
This fills a real coverage gap — unnumbered BGP over link-local is a supported SONiC/FRR feature with no prior test. The structure is solid: fixture-based setup/teardown, graceful skips for non-EOS neighbors, wait_until for session establishment, and @pytest.mark.disable_loganalyzer. The author has addressed the previous Copilot bot comments (conditional remove_lines, non-EOS skip, teardown pytest_assert). This review focuses on remaining open issues and newly identified ones.
✅ Strengths
- Good test structure:
setup_infogathers facts cleanly,configure_unnumbered_bgphandles setup/teardown separation,bgp_unnumbered_establishedis a clean polling predicate - Graceful
pytest.skip()for missing neighbor, non-EOS testbed, and no PortChannel found - Comprehensive debug output on failure (BGP summary + neighbor detail from both DUT and EOS)
wait_untilused consistently for session establishment — no baretime.sleepfor the critical pathconfig_reloadon DUT teardown is the right approach for complete state restoration
📝 Review Findings
(Not repeating existing open comments on EOS backup/restore or multi-ASIC — those are already tracked)
⚠️ Major Issues
-
Route verification can false-positive on any existing fe80 peer —
'fe80' in peer.lower()matches all link-local BGP peers, not just the selected PortChannel's. On multi-PortChannel T0s this can pass even when the configured session has no routes. (See inline comment at line 395) -
EOS cleanup exception is silently swallowed — if
eos_configfails during teardown, EOS remains in a partial state (link-local neighbor still configured, original neighbors not restored), but code continues toconfig_reloadand then thecheck_bgp_session_statewait times out with a confusing error. (See inline comment at line 320)
📝 Minor Issues
-
dut_asn = common_props['dut_asn']raises an unguardedKeyError— should use.get()with apytest.skip()for a clean error message. (See inline comment at line 84) -
pc_intfsis assigned but never used — dead variable fromconfig_facts. (See inline comment at line 93) -
time.sleep(10)in test body — arbitrary sleep for route convergence with no comment or named constant. (See inline comment at line 382)
💡 Suggestions
'Established' in stdoutsubstring match is fragile — consider usingvtysh -c 'show bgp neighbors <intf> json'and checkingbgpStatefield programmatically for FRR-version resilience. (See inline comment at line 430)
🧪 Testing
- Tested on KVM (500s runtime), which is appropriate for a slow BGP convergence test
- CI pipeline eventually failed — worth confirming whether this is infrastructure flakiness or a code issue
- The open EOS backup/restore thread remains unresolved — acceptable for KVM but should be tracked as a follow-up before running on physical testbeds with richer EOS configs
Recommendations
- Fix the fe80 false-positive route match (major — correctness issue)
- Harden EOS cleanup failure handling in teardown (major — can mask real failures)
- Guard
dut_asnaccess with.get()+pytest.skip()(minor) - Remove unused
pc_intfs(minor, one line)
Status
🤖 Generated with GitHub Copilot
| .format(portchannel, WAIT_TIMEOUT)) | ||
|
|
||
| logger.info("Unnumbered BGP session established!") | ||
|
|
There was a problem hiding this comment.
if (portchannel.lower() in peer.lower()
or 'fe80' in peer.lower()):The 'fe80' in peer.lower() branch matches any link-local peer in the BGP summary, not just the one on the selected PortChannel. On a T0 with multiple PortChannels that already have link-local sessions (e.g., IPv6 ECMP), this assertion can pass even if the specific unnumbered session you just configured has no routes.
Fix: only match on the specific interface or derive the expected fe80 address from the bgp_unnumbered_established result and compare exactly.
# More precise: only match peers identified by this portchannel
if portchannel.lower() in peer.lower():
...| # --- Teardown --- | ||
| logger.info("Teardown: Restoring original configuration") | ||
|
|
||
| # Remove link-local neighbor from EOS and restore originals |
There was a problem hiding this comment.
except Exception as e:
logger.warning("EOS cleanup failed: %s", e)If eos_config fails (network error, EOS refuses the config, etc.), teardown logs a warning and continues to config_reload. The DUT is then restored, but EOS still has the link-local neighbor stanza. The subsequent wait_until(duthost.check_bgp_session_state, original_neighbors) will likely time out because EOS hasn't re-added the original neighbor <ip> remote-as lines — causing a confusing fixture teardown error that masks the real root cause.
At minimum, re-raise or use pytest.fail() after logging so the failure is visible. Alternatively, add a verification step after eos_config to confirm the cleanup succeeded before proceeding to config_reload.
|
|
||
| duthost = duthosts[rand_one_dut_hostname] | ||
| dut_asn = common_props['dut_asn'] | ||
|
|
There was a problem hiding this comment.
📝 Minor: dut_asn access will raise an unguarded KeyError
dut_asn = common_props['dut_asn']If dut_asn is absent from common_props (e.g., testbed YAML missing the key), this raises a KeyError instead of a clean skip or informative error. Use .get() and skip explicitly:
dut_asn = common_props.get('dut_asn')
if not dut_asn:
pytest.skip("dut_asn not found in testbed configuration_properties")| pc_intfs = config_facts.get('PORTCHANNEL_INTERFACE', {}) | ||
|
|
||
| # Build map: neighbor name -> PortChannel | ||
| pc_to_neighbor = {} |
There was a problem hiding this comment.
📝 Minor: pc_intfs is assigned but never used
pc_intfs = config_facts.get('PORTCHANNEL_INTERFACE', {})This variable is fetched from config_facts but never referenced anywhere in setup_info. Remove it to avoid confusion about whether it's intended to be used.
| neigh_host = setup_info['neigh_host'] | ||
| eos_bgp = neigh_host.eos_command( | ||
| commands=["show ip bgp summary"]) | ||
| eos_out = eos_bgp['stdout'][0] if isinstance( |
There was a problem hiding this comment.
📝 Minor: Hardcoded time.sleep(10) with no justification
time.sleep(10)An arbitrary 10-second sleep after confirming the BGP session is established has no documented rationale. If the goal is to allow route convergence, wait_until with a route-count check is more reliable. If 10s is empirically needed on slow testbeds, add a comment explaining why and consider making it a named constant.
# Allow time for route convergence after session establishment
ROUTE_CONVERGENCE_WAIT = 10
time.sleep(ROUTE_CONVERGENCE_WAIT)|
|
||
| # Verify session details | ||
| logger.info("Verify session details") | ||
| detail = duthost.shell( |
There was a problem hiding this comment.
💡 Suggestion: show bgp neighbors <intf> string match is fragile
pytest_assert('Established' in detail.get('stdout', ''),
"BGP neighbor detail does not show Established state")Matching 'Established' as a substring of the raw text output of vtysh works today, but text formats change across FRR versions. A more robust approach is to use the JSON form and check the state field:
detail_json = duthost.shell(
"vtysh -c 'show bgp neighbors {} json'".format(portchannel),
module_ignore_errors=True)
if detail_json['rc'] == 0:
nbr_data = json.loads(detail_json['stdout'])
# JSON key is the fe80 address or interface name
for nbr_key, nbr_info in nbr_data.items():
state = nbr_info.get('bgpState', '')
pytest_assert(state == 'Established', ...)- Fix fe80 false-positive: only match peers by PortChannel interface name, not any link-local peer in BGP summary - Harden EOS cleanup: fail visibly instead of silently swallowing errors - Guard dut_asn access with .get() and pytest.skip() - Remove unused pc_intfs variable - Add comment explaining time.sleep(10) for route convergence Addresses review feedback from lolyu. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Thanks @lolyu for the thorough review! Here's the status on each point:
Please take another look! |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Replace two time.sleep calls with proper wait_until polling: - sleep(3) after removing old neighbor -> wait_until checking neighbor removed - sleep(10) for route convergence -> wait_until checking pfxRcd > 0 Remove unused time import. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What is the motivation for this PR?
Currently there is no test for BGP peering over IPv6 link-local addresses (unnumbered BGP). This is a supported feature in FRR/SONiC but has no test coverage in sonic-mgmt.
Addresses test gap issue #18431.
How did you do it?
Added tests/bgp/test_bgp_link_local.py with one test case (test_bgp_link_local_ipv6) that:
Uses vtysh JSON output to detect unnumbered peers since the Ansible bgp_facts module may not parse them.
How did you verify/test it?
Tested on KVM testbed (vms-kvm-t0, T0 topology with converged cEOS peers):
bgp/test_bgp_link_local.py::test_bgp_link_local_ipv6 PASSED
1 passed in 500.13s (8:20)
Signed-off-by: Ying Xie ying.xie@microsoft.com