Skip to content

[bgp] Add test for BGPv6 peering over link-local addresses (unnumbered)#22600

Open
yxieca wants to merge 8 commits intosonic-net:masterfrom
yxieca:test/bgp-link-local
Open

[bgp] Add test for BGPv6 peering over link-local addresses (unnumbered)#22600
yxieca wants to merge 8 commits intosonic-net:masterfrom
yxieca:test/bgp-link-local

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Feb 25, 2026

What is the motivation for this PR?

Currently there is no test for BGP peering over IPv6 link-local addresses (unnumbered BGP). This is a supported feature in FRR/SONiC but has no test coverage in sonic-mgmt.

Addresses test gap issue #18431.

How did you do it?

Added tests/bgp/test_bgp_link_local.py with one test case (test_bgp_link_local_ipv6) that:

  1. Selects a PortChannel with an existing BGP neighbor
  2. Removes global-IP BGP sessions on both DUT and cEOS peer
  3. Configures unnumbered BGP on DUT (neighbor interface remote-as)
  4. Configures link-local BGP neighbor on EOS peer (fe80 address with interface scope)
  5. Verifies the BGP session establishes via IPv6 link-local
  6. Verifies routes are exchanged over the link-local session
  7. Restores original config via config reload

Uses vtysh JSON output to detect unnumbered peers since the Ansible bgp_facts module may not parse them.

How did you verify/test it?

Tested on KVM testbed (vms-kvm-t0, T0 topology with converged cEOS peers):

bgp/test_bgp_link_local.py::test_bgp_link_local_ipv6 PASSED
1 passed in 500.13s (8:20)

Signed-off-by: Ying Xie ying.xie@microsoft.com

Add test to verify BGP peering over IPv6 link-local addresses (unnumbered
BGP) can be established via PortChannel interfaces.

Test steps:
1. Select a PortChannel with an existing BGP neighbor
2. Remove global-IP BGP sessions on both DUT and cEOS peer
3. Configure unnumbered BGP (neighbor <interface> interface remote-as)
4. Configure link-local neighbor on EOS peer
5. Verify session establishes and routes are exchanged
6. Restore original config via config reload

Addresses test gap issue sonic-net#18431.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

- Restrict to T0 only (T1 may lack default route, failing conftest)
- Reduce WAIT_TIMEOUT from 180s to 120s
- Reduce config_reload wait from 360s to 120s
- Reduce route exchange sleep from 15s to 10s
- Total test time reduced from 8m20s to 4m15s

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

- Remove device_type('vs') so test runs on physical testbeds too
- Skip instead of fail when unnumbered session doesn't establish
  (prevents sanity check failures for subsequent test modules)

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Move BGP configuration changes and cleanup into a
configure_unnumbered_bgp fixture with yield, separating test
logic from setup/teardown. This ensures cleanup always runs
via pytest's fixture teardown mechanism.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

The failure message now points out the possibility of a neighbor
compatibility issue, since the peer may not support BGP peering
over IPv6 link-local addresses. The fixture teardown restores the
original configuration and verifies normal BGP sessions recover
before the failure is reported.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new pytest-based BGP test to validate IPv6 link-local (unnumbered) BGP peering behavior in SONiC, closing a known coverage gap in sonic-mgmt’s BGP test suite.

Changes:

  • Introduce test_bgp_link_local_ipv6 to reconfigure an existing PortChannel BGP peer into unnumbered/link-local peering and verify session + route exchange.
  • Add helper logic to identify a suitable PortChannel neighbor and to detect established unnumbered sessions via vtysh JSON.

Comment on lines +265 to +270
logger.info("Remove DUT neighbors on EOS peer")
remove_lines = ["no neighbor {}".format(dut_ipv4)]
if dut_ipv6:
remove_lines.append("no neighbor {}".format(dut_ipv6))
neigh_host.eos_config(lines=remove_lines,
parents="router bgp {}".format(neigh_asn))
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_lines always includes no neighbor {dut_ipv4} even when dut_ipv4 wasn't found (it can be None if the PortChannel has only IPv6 configured in config DB). That would push an invalid EOS config line (no neighbor None) and fail the setup. Build remove_lines conditionally (only add the IPv4/IPv6 neighbor lines when the corresponding IP is present).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit c89a93e:

  1. Conditional remove_lines — now only adds no neighbor lines when dut_ipv4/dut_ipv6 are not None.
  2. EOS backup/restore — good point about losing pre-existing per-neighbor settings. The DUT side uses config_reload which fully restores. The EOS side currently restores just neighbor <ip> remote-as which is sufficient for the KVM/vEOS testbed topology (no custom timers/route-maps). I'll note this as a future improvement.
  3. Teardown hard failure — changed from logger.error to pytest_assert(False, ...) so it surfaces as a test failure.
  4. Non-EOS skip — added neighbor_type check at the top of setup_info, skips gracefully instead of AttributeError.
  5. Multi-ASIC — this test is currently t0-only topology which is single-ASIC. Will add ASIC namespace awareness if we expand to t1/t2.

Thanks for the review!

Comment on lines +308 to +318
# Remove link-local neighbor from EOS and restore originals
try:
cleanup_lines = ["no neighbor {}".format(eos_neighbor)]
if dut_ipv4:
cleanup_lines.append(
"neighbor {} remote-as {}".format(dut_ipv4, dut_asn))
if dut_ipv6:
cleanup_lines.append(
"neighbor {} remote-as {}".format(dut_ipv6, dut_asn))
neigh_host.eos_config(lines=cleanup_lines,
parents="router bgp {}".format(neigh_asn))
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For EOS neighbors, this test removes the existing neighbor stanzas but only restores them with neighbor <ip> remote-as <asn>. Any pre-existing per-neighbor settings (timers, route-maps, password, maximum-routes, etc.) would be lost, which can impact subsequent tests. Other BGP tests typically eos_config(backup=True, ...) before changes and load_configuration(...) in teardown; consider using the same backup/restore approach here.

Copilot uses AI. Check for mistakes.
Comment on lines +329 to +333
if not wait_until(WAIT_TIMEOUT, POLL_INTERVAL, 0,
duthost.check_bgp_session_state, original_neighbors):
logger.error("BGP sessions did not re-establish after config reload")
else:
logger.info("Original configuration restored, all sessions up")
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In teardown, if BGP sessions fail to re-establish after config_reload, the fixture only logs an error and continues. That can leave the DUT in a bad state while still reporting the test as passed, and may break later tests in the run. Please make this a hard failure (e.g., pytest_assert(...) / raise) when the sessions do not come back up.

Suggested change
if not wait_until(WAIT_TIMEOUT, POLL_INTERVAL, 0,
duthost.check_bgp_session_state, original_neighbors):
logger.error("BGP sessions did not re-establish after config reload")
else:
logger.info("Original configuration restored, all sessions up")
sessions_restored = wait_until(
WAIT_TIMEOUT,
POLL_INTERVAL,
0,
duthost.check_bgp_session_state,
original_neighbors
)
pytest_assert(
sessions_restored,
"BGP sessions did not re-establish after config reload"
)
logger.info("Original configuration restored, all sessions up")

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +74
def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo):
"""Gather setup information for the link-local BGP test."""
duthost = duthosts[rand_one_dut_hostname]
dut_asn = tbinfo['topo']['properties']['configuration_properties']['common']['dut_asn']
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test assumes the selected neighbor is an EOS/cEOS host (uses eos_command/eos_config). When the testbed is run with neighbor_type=sonic (or any non-EOS neighbor), this fixture will raise an AttributeError instead of skipping. Consider taking request in this fixture and skipping unless neighbor_type is "eos" (or isinstance(neigh_host, EosHost)), similar to other BGP tests that support multiple neighbor types.

Suggested change
def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo):
"""Gather setup information for the link-local BGP test."""
duthost = duthosts[rand_one_dut_hostname]
dut_asn = tbinfo['topo']['properties']['configuration_properties']['common']['dut_asn']
def setup_info(duthosts, rand_one_dut_hostname, nbrhosts, tbinfo, request):
"""Gather setup information for the link-local BGP test."""
# This test currently assumes EOS/cEOS neighbors because it uses eos_command/eos_config
# on the neighbor host. Skip gracefully when running with non-EOS neighbors (e.g. SONiC).
common_props = tbinfo.get('topo', {}).get('properties', {}).get('configuration_properties', {}).get('common', {})
neighbor_type = common_props.get('neighbor_type', 'eos')
if neighbor_type.lower() != 'eos':
pytest.skip("BGP link-local test requires EOS neighbors; current neighbor_type is '{}'".format(neighbor_type))
duthost = duthosts[rand_one_dut_hostname]
dut_asn = common_props['dut_asn']

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +137
# Get DUT's link-local on this PortChannel
result = duthost.shell("ip -6 addr show dev {} scope link".format(selected['pc']))
dut_link_local = None
for line in result['stdout'].split('\n'):
if 'inet6 fe80' in line.strip():
dut_link_local = line.strip().split()[1].split('/')[0]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On multi-ASIC DUTs, commands like ip -6 addr ... and subsequent vtysh calls may need to run in the correct ASIC namespace for the selected PortChannel. Right now everything runs via duthost.shell(...) without selecting an ASIC, which can cause the test to fail on multi-ASIC platforms. Consider determining the ASIC for selected['pc'] (e.g., via duthost.get_asic_index_for_portchannel() / get_port_asic_instance()) and storing an asichost in setup_info, then use asichost.command/shell and asichost.run_vtysh(...) consistently for interface and BGP operations.

Copilot uses AI. Check for mistakes.
- Skip gracefully for non-EOS neighbor types instead of AttributeError
- Build remove_lines conditionally (dut_ipv4 can be None on IPv6-only)
- Hard-fail teardown if BGP sessions don't re-establish after config reload

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines failed to run 1 pipeline(s).

Copy link
Contributor

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📊 Overview

Files Changed: 1 (tests/bgp/test_bgp_link_local.py)
Lines: +435

This fills a real coverage gap — unnumbered BGP over link-local is a supported SONiC/FRR feature with no prior test. The structure is solid: fixture-based setup/teardown, graceful skips for non-EOS neighbors, wait_until for session establishment, and @pytest.mark.disable_loganalyzer. The author has addressed the previous Copilot bot comments (conditional remove_lines, non-EOS skip, teardown pytest_assert). This review focuses on remaining open issues and newly identified ones.


✅ Strengths

  • Good test structure: setup_info gathers facts cleanly, configure_unnumbered_bgp handles setup/teardown separation, bgp_unnumbered_established is a clean polling predicate
  • Graceful pytest.skip() for missing neighbor, non-EOS testbed, and no PortChannel found
  • Comprehensive debug output on failure (BGP summary + neighbor detail from both DUT and EOS)
  • wait_until used consistently for session establishment — no bare time.sleep for the critical path
  • config_reload on DUT teardown is the right approach for complete state restoration

📝 Review Findings

(Not repeating existing open comments on EOS backup/restore or multi-ASIC — those are already tracked)

⚠️ Major Issues

  1. Route verification can false-positive on any existing fe80 peer'fe80' in peer.lower() matches all link-local BGP peers, not just the selected PortChannel's. On multi-PortChannel T0s this can pass even when the configured session has no routes. (See inline comment at line 395)

  2. EOS cleanup exception is silently swallowed — if eos_config fails during teardown, EOS remains in a partial state (link-local neighbor still configured, original neighbors not restored), but code continues to config_reload and then the check_bgp_session_state wait times out with a confusing error. (See inline comment at line 320)

📝 Minor Issues

  1. dut_asn = common_props['dut_asn'] raises an unguarded KeyError — should use .get() with a pytest.skip() for a clean error message. (See inline comment at line 84)

  2. pc_intfs is assigned but never used — dead variable from config_facts. (See inline comment at line 93)

  3. time.sleep(10) in test body — arbitrary sleep for route convergence with no comment or named constant. (See inline comment at line 382)

💡 Suggestions

  1. 'Established' in stdout substring match is fragile — consider using vtysh -c 'show bgp neighbors <intf> json' and checking bgpState field programmatically for FRR-version resilience. (See inline comment at line 430)

🧪 Testing

  • Tested on KVM (500s runtime), which is appropriate for a slow BGP convergence test
  • CI pipeline eventually failed — worth confirming whether this is infrastructure flakiness or a code issue
  • The open EOS backup/restore thread remains unresolved — acceptable for KVM but should be tracked as a follow-up before running on physical testbeds with richer EOS configs

Recommendations

  1. Fix the fe80 false-positive route match (major — correctness issue)
  2. Harden EOS cleanup failure handling in teardown (major — can mask real failures)
  3. Guard dut_asn access with .get() + pytest.skip() (minor)
  4. Remove unused pc_intfs (minor, one line)

Status

⚠️ Minor suggestions / requests for changes — the fe80 route false-positive is the main correctness concern; everything else is minor or already tracked

🤖 Generated with GitHub Copilot

.format(portchannel, WAIT_TIMEOUT))

logger.info("Unnumbered BGP session established!")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major: Route verification can false-positive on any existing fe80 peer

if (portchannel.lower() in peer.lower()
        or 'fe80' in peer.lower()):

The 'fe80' in peer.lower() branch matches any link-local peer in the BGP summary, not just the one on the selected PortChannel. On a T0 with multiple PortChannels that already have link-local sessions (e.g., IPv6 ECMP), this assertion can pass even if the specific unnumbered session you just configured has no routes.

Fix: only match on the specific interface or derive the expected fe80 address from the bgp_unnumbered_established result and compare exactly.

# More precise: only match peers identified by this portchannel
if portchannel.lower() in peer.lower():
    ...

# --- Teardown ---
logger.info("Teardown: Restoring original configuration")

# Remove link-local neighbor from EOS and restore originals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Major: EOS cleanup failure is silently swallowed — leaves EOS in bad state

except Exception as e:
    logger.warning("EOS cleanup failed: %s", e)

If eos_config fails (network error, EOS refuses the config, etc.), teardown logs a warning and continues to config_reload. The DUT is then restored, but EOS still has the link-local neighbor stanza. The subsequent wait_until(duthost.check_bgp_session_state, original_neighbors) will likely time out because EOS hasn't re-added the original neighbor <ip> remote-as lines — causing a confusing fixture teardown error that masks the real root cause.

At minimum, re-raise or use pytest.fail() after logging so the failure is visible. Alternatively, add a verification step after eos_config to confirm the cleanup succeeded before proceeding to config_reload.


duthost = duthosts[rand_one_dut_hostname]
dut_asn = common_props['dut_asn']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Minor: dut_asn access will raise an unguarded KeyError

dut_asn = common_props['dut_asn']

If dut_asn is absent from common_props (e.g., testbed YAML missing the key), this raises a KeyError instead of a clean skip or informative error. Use .get() and skip explicitly:

dut_asn = common_props.get('dut_asn')
if not dut_asn:
    pytest.skip("dut_asn not found in testbed configuration_properties")

pc_intfs = config_facts.get('PORTCHANNEL_INTERFACE', {})

# Build map: neighbor name -> PortChannel
pc_to_neighbor = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Minor: pc_intfs is assigned but never used

pc_intfs = config_facts.get('PORTCHANNEL_INTERFACE', {})

This variable is fetched from config_facts but never referenced anywhere in setup_info. Remove it to avoid confusion about whether it's intended to be used.

neigh_host = setup_info['neigh_host']
eos_bgp = neigh_host.eos_command(
commands=["show ip bgp summary"])
eos_out = eos_bgp['stdout'][0] if isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Minor: Hardcoded time.sleep(10) with no justification

time.sleep(10)

An arbitrary 10-second sleep after confirming the BGP session is established has no documented rationale. If the goal is to allow route convergence, wait_until with a route-count check is more reliable. If 10s is empirically needed on slow testbeds, add a comment explaining why and consider making it a named constant.

# Allow time for route convergence after session establishment
ROUTE_CONVERGENCE_WAIT = 10
time.sleep(ROUTE_CONVERGENCE_WAIT)


# Verify session details
logger.info("Verify session details")
detail = duthost.shell(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: show bgp neighbors <intf> string match is fragile

pytest_assert('Established' in detail.get('stdout', ''),
              "BGP neighbor detail does not show Established state")

Matching 'Established' as a substring of the raw text output of vtysh works today, but text formats change across FRR versions. A more robust approach is to use the JSON form and check the state field:

detail_json = duthost.shell(
    "vtysh -c 'show bgp neighbors {} json'".format(portchannel),
    module_ignore_errors=True)
if detail_json['rc'] == 0:
    nbr_data = json.loads(detail_json['stdout'])
    # JSON key is the fe80 address or interface name
    for nbr_key, nbr_info in nbr_data.items():
        state = nbr_info.get('bgpState', '')
        pytest_assert(state == 'Established', ...)

- Fix fe80 false-positive: only match peers by PortChannel interface name,
  not any link-local peer in BGP summary
- Harden EOS cleanup: fail visibly instead of silently swallowing errors
- Guard dut_asn access with .get() and pytest.skip()
- Remove unused pc_intfs variable
- Add comment explaining time.sleep(10) for route convergence

Addresses review feedback from lolyu.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested a review from lolyu March 2, 2026 17:47
@yxieca
Copy link
Collaborator Author

yxieca commented Mar 3, 2026

Thanks @lolyu for the thorough review! Here's the status on each point:

  1. fe80 false-positive in route verification — Fixed: the route verification now only matches portchannel.lower() in peer.lower() (line ~419), no longer matches any arbitrary fe80 peer.

  2. EOS cleanup failure silently swallowed — Fixed: the exception is now caught, logged with logger.error, and a eos_cleanup_failed flag is set. After config_reload completes, pytest.fail() is called if the flag is set (line ~354), so the failure is always visible.

  3. dut_asn KeyError — Fixed: uses .get() with explicit pytest.skip() if not found (lines ~83-85).

  4. Unused pc_intfs — Removed.

  5. Hardcoded time.sleep(10) — Added comment explaining rationale: 'Allow time for route convergence after session establishment' (line ~408). The 10s sleep is a convergence buffer after wait_until confirms the session is Established — replacing with wait_until on route count would add complexity for minimal benefit since we already assert route_count > 0 right after.

  6. Fragile string match for Established — Valid point. This is a secondary confirmation after the wait_until check already verified the session via JSON summary. I'll keep this as a minor enhancement for a follow-up since the primary check is already robust.

Please take another look!

@yxieca
Copy link
Collaborator Author

yxieca commented Mar 5, 2026

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Replace two time.sleep calls with proper wait_until polling:
- sleep(3) after removing old neighbor -> wait_until checking neighbor removed
- sleep(10) for route convergence -> wait_until checking pfxRcd > 0
Remove unused time import.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants