Link Flap: Adding interface and drop counter checks for continuous link flap test.#22581
Link Flap: Adding interface and drop counter checks for continuous link flap test.#22581rraghav-cisco wants to merge 80 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
AI agent on behalf of Ying: Review notes:
Otherwise the approach looks reasonable. |
|
@rraghav-cisco can you commit with -s option so that the commits are signed (avoid the DCO failure)? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…lid syntax" (sonic-net#22189) sonic-net/sonic-buildimage#24691 have upgraded sflowtool version last week. This recent version of sflowtool has sflow/sflowtool@6805817 changes, which is setting socket buffer size (in case default is too low). Above change is causing "sflowtool" to print messages like "requestSocketBuffer(sflow/v6 in): already have 31457280" to stderr, but the PTF test redirects stderr to stdout (stderr=subprocess.STDOUT). These non-JSON lines get mixed into the output file that expects only JSON, and causes parser to carsh when trying to parse the message as Python literal. Fix: redirect stderr to /dev/null instead of stdout (as these are informational messages, and not critical errors) Signed-off-by: Vinod <vkjammala@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…ic-net#22562) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue sonic-net#22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <ying.xie@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…nic-net#22361) What is the motivation for this PR? SONiC requires the hardware watchdog to be enabled on all platforms to ensure system recovery from hangs. This test catches platforms where the watchdog is missing or disabled. Fixes: sonic-net#21686 How did you do it? Added test_hw_watchdog.py with two test cases: 1. test_hw_watchdog_supported - Verifies watchdogutil is available 2. test_hw_watchdog_armed - Verifies the hardware watchdog is armed By default, an unarmed watchdog produces a warning and skips. Pass --strict_watchdog to treat it as a test failure. How did you verify/test it? platform_tests/test_hw_watchdog.py::test_hw_watchdog_supported[strtk5-7260-01] PASSED platform_tests/test_hw_watchdog.py::test_hw_watchdog_armed[strtk5-7260-01] SKIPPED (Watchdog is not armed, use --strict_watchdog to enforce) Signed-off-by: Ying Xie <ying.xie@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Tejaswini Chadaga <tchadaga@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
What is the motivation for this PR Replace .semgrepignore with targeted inline nosemgrep comments for 28 legacy infrastructure files (ansible, spytest). This addresses the semgrep findings without blanket directory-level suppression. How did you do it Added inline nosemgrep annotations to 15 files and reformatted to keep lines within 120 chars. No functional code changes. How did you verify/test it Not provided in PR description. Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…a… (sonic-net#22390) What is the motivation for this PR Not provided in PR description. How did you do it Added new pytest fixtures. How did you verify/test it Ran the tests and verified the DPU Tables HA Scope and HA Set. Signed-off-by: nnelluri-cisco <nnelluri@cisco.com> Signed-off-by: nnelluri <nnelluri@cisco.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…onic-net#22570) Description of PR The parameter "hostname" in variables.override.yml is defined as "host_name". This causes key-error in function get_snappi_ports_for_rdma present in snappi_fixtures.py in the following code: for port in snappi_port_list: for var_rx_port in var_rx_ports: if port['peer_port'] == var_rx_port['port_name'] and port['peer_device'] == var_rx_port['hostname']: rx_snappi_ports.append(port) for var_tx_port in var_tx_ports: if port['peer_port'] == var_tx_port['port_name'] and port['peer_device'] == var_tx_port['hostname']: tx_snappi_ports.append(port) Summary: Fixes sonic-net#22360 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Fixed a minor typo error. How did you do it? Fixed a minor typo error. Corrected the variable 'host_name" to "hostname" in the yml file. How did you verify/test it? Ran it on local branch with fixed parameter. Signed-off-by: amitpawa <amit.2.pawar@nokia.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Summary: decap/test_decap.py failed for IPv4inIPv6 and IPv6inIPv6 encapsulation combinations for cisco-8000. ip-in-ipv6 decap with dscp=pipe is not supported for Cisco Q200 and was blocked via sonic-mgmt/tests/common/plugins/conditional_mark/tests_mark_conditions.yaml at 7dabd91 · sonic-net/sonic-mgmt The blocking stopped working after [test-decap] Fix test decap for pipe mode by developfast · Pull Request sonic-net#20304 · sonic-net/sonic-mgmt that removed test parametrization from conftest.py, the skip conditions in tests_mark_conditions.yaml for test_decap[ttl=pipe, dscp=pipe, vxlan=disable] no longer match the test name. This causes test_decap to run with dscp=pipe mode on cisco-8000 as the ipinip.json.j2 has dscp_mode=pipe configured. Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…sts (sonic-net#22582) What is the motivation for this PR? SmartSwitch testbeds contain DPU hosts that require NAT port forwarding through their associated NPU hosts for SSH access. Without this configuration, the testbed health checker cannot reach DPU hosts to verify their status. How did you do it? Added _get_testbed_dut_names() to read DUT hostnames from testbed file (no SSH required) Refactored init_hosts() to: Separate NPU and DPU hostnames based on 'dpu' in name Initialize NPU hosts first (directly reachable) Enable NAT on NPU hosts using sonic-dpu-mgmt-traffic.sh Then initialize DPU hosts (now reachable via NAT) Updated _get_dpu_name_ssh_port_dict() to accept dpu_hostnames parameter Updated enable_nat_for_dpuhosts() to accept optional dpu_hostnames parameter How did you verify/test it? Tested on SmartSwitch testbed with DPU hosts - verified NAT configuration is applied and health checks complete successfully on both NPU and DPU hosts. Any platform specific information? Specific to SmartSwitch platforms with DPU hosts. Supported testbed topology if it's a new test case? N/A - This is a testbed infrastructure improvement. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…opologies (sonic-net#21820) What is the motivation for this PR? This test triggers warm reboot which is broken on dualtor and can cause orchagent crash etc. How did you do it? Skipping the warm-reboot part of the test for dualtor How did you verify/test it? ran the test Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
The problem is introduced by this PR sonic-net#20444. It has introduced a change to cleanup arp cache on both active and standby TORs. This has led to discovery of a known issue sonic-net/sonic-swss#2579. There is a known issue which causes redundant route entry delete and reports an ERR log. FYI, sonic-net/sonic-swss#2579 This error can be safely ignored on standby. Verified this manually. Signed-off-by: dhanasekar-arista <dhanasekar@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…onic-net#21760) * Fix acl [ipv6-ingress-uplink->downlink-*] cases for v6 topo Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…fined (sonic-net#22369) What is the motivation for this PR? PR sonic-net#21636 imported GLOBAL_PARAMS to refactor pr_test_template.yml, caused downstream templates run-test-elastictest-template.yml to fail to receive the parameters correctly. While the top-level YAML could pass GLOBAL_PARAMS successfully to the first-level template, nested dynamic expansion inside the template does not reliably propagate key/value pairs, especially when variables or non-static values are involved. This PR aims to make parameter passing stable and predictable by replacing the dynamic each expansion with explicit parameter definitions. image How did you do it? Revert dynamic GLOBAL_PARAMS expansion to explicit parameters for reliable template passing. How did you verify/test it? image Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
* [drop_packets] Replace time.sleep with wait_until in test_configurable_drop_counters Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> * [crm] Replace time.sleep with wait_until in test_crm Replace time.sleep calls with wait_until for CRM counter and config convergence. Improves reliability on KVM/VS platforms. Fixes sonic-net#20256 Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> --------- Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…rade (sonic-net#22354) What is the motivation for this PR The GNMI container is being hardened in sonic-buildimage PR #25089, replacing --privileged with specific Linux capabilities. The container_upgrade test suite needs to use the correct docker run parameters and include gNOI tests to verify functionality with the new config. How did you do it Updated parameters.json to add required flags (pid/userns/uts host, needed caps, and apparmor/seccomp unconfined) for docker-sonic-gnmi, and added gnxi gNOI system/file tests to testcases.json. How did you verify/test it Aligned with the tested config in PR #25089, verified on vlab-01 with: - docker exec gnmi nsenter -t 1 -m -n -p -u -i sonic-installer list - docker exec gnmi nsenter -t 1 -m -n -p -u -i reboot Signed-off-by: Dawei Huang <daweihuang@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Approach What is the motivation for this PR? After upgrading scapy in ptf container, a bug is introduced. BFDResponder generates BFD packet with auth field even auth flag is not enabled. The authentication field is appended to the end of the BFD packet without adjusting UDP header length. This causes udp checksum verification failed. Here is the packet from PTF: 18:16:27.682014 IP6 fddd:a100:a0::a37:10.49157 > fc00:1::32.4784: UDP, bad length 35 > 24 0x0000: 225d a77e b78e 1e44 8b06 c367 86dd 6000 0x0010: 0000 0020 11ff fddd a100 00a0 0000 0000 0x0020: 0000 0a37 0010 fc00 0001 0000 0000 0000 0x0030: 0000 0000 0032 c005 12b0 002b 9c68 2080 0x0040: 0a18 cdba 0001 c349 ff6a 000f 4240 000f 0x0050: 4240 0000 0001 010b 0170 6173 7377 6f72 0x0060: 64 Here is the issue about scapy bfd issue: secdev/scapy#4937 How did you do it? Set optional_auth to None to get around the bug How did you verify/test it? Verified with sonic-mgmt test Signed-off-by: Yue Gao <yuega2@cisco.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Update the qos sai test for the dualtor scenario. Previously we use ptf32 topo to mock the t0 and t1 roles in dualtor scenario by enabling the dscp remapping and configure corresponding neighbors. Now after this update, we don't need the ptf32 anymore, the dualtor scenario test can directly run on a real t1 topo or the dualtor topo. We still need specify --qos_dual_tor=True in the pytest command to enable the dualtor scenario. Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…#22252) We should check the DPUs are offline before checking they are online after the kernel panic or memory exhaustion. Otherwise the check for DPU online could pass even before the DPUs are rebooted and the later crirical services check will fail. Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…-net#21841) * Fix expected destination nexthop list for everflow ECMP test Signed-off-by: venu-nexthop <venu@nexthop.ai> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…nic-net#22340) What is the motivation for this PR? The previous check_config_applied function compared the total number of entries in MUX_CABLE_TABLE with the number of interfaces. However, MUX_CABLE_TABLE contains entries for all mux ports in the system (e.g., 24 ports in T0 KVM), which always remain the same. The previous implementation of check_config_applied() validated the mux state change by comparing the total number of entries in MUX_CABLE_TABLE with the number of interfaces being updated. When updating only a subset of interfaces, it causes the check to fail consistently and wait_until function to retry until the 2min timeout. How did you do it? By checking the state of each instead of comparing the total number of table entries. How did you verify/test it? Before the improvement: https://elastictest.org/scheduler/testplan/698bf70cbb7d1dad4c804782?searchTestCase=test_orchagent_active_tor_downstream&testcase=dualtor%2Ftest_orchagent_active_tor_downstream.py&type=log&leftSideViewMode=detail Signed-off-by: yawenni <yawenni@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…ic-net#22589) * [acl] Replace time.sleep with wait_until in test_src_mac_rewrite Replace 10 time.sleep calls with wait_until using proper check functions for ACL table, rule, VXLAN tunnel, and counter convergence in CONFIG_DB and STATE_DB. Improves test reliability on KVM/VS platforms. Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> * Address review: add wait after configure_vxlan_switch Add a wait_until check after configure_vxlan_switch() to verify the SWITCH_TABLE config has propagated, rather than leaving no delay. Note: There is no config_reload in create_vxlan_vnet_config() -- config is applied via sonic-cfggen --write-to-db, and the VXLAN tunnel presence is already verified with wait_until before proceeding. Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> * [acl] Assert wait_until for VXLAN switch config and clarify no config_reload in path Address CHANGES_REQUESTED from StormLiangMS: - Assert the wait_until after configure_vxlan_switch to catch silent failures - Note: There is no config_reload in create_vxlan_vnet_config — config is applied via sonic-cfggen --write-to-db, and VXLAN tunnel presence is already verified before calling configure_vxlan_switch. Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> --------- Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: SRAVANI KANASANI <kanasanis@google.com> Co-authored-by: kishanps <kishanps@google.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
) * Adding l3_admit_test and smoke_test to sonic-mgmt. Signed-off-by: SRAVANI KANASANI <kanasanis@google.com> * CGO interface for testhelper through thinkit. Signed-off-by: SRAVANI KANASANI <kanasanis@google.com> --------- Signed-off-by: SRAVANI KANASANI <kanasanis@google.com> Co-authored-by: kishanps <kishanps@google.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…onic-net#22509) What is the motivation for this PR Pytest 9.0.2 fails to determine rootdir/conftest when no explicit test path is provided; pretest/posttest/bsl invocations need to avoid unrecognized arguments errors. How did you do it Added to pytest commands that lacked it (pretest, posttest, bsl). How did you verify/test it Not provided in PR description. Signed-off-by: markxiao <markxiao@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…onic-net#22460) What is the motivation for this PR The previous test didn’t reliably detect privileged containers; it only checked partition block devices and could misclassify containers. How did you do it Check each running container’s docker config for privileged status, and extend the mount check to include raw block devices in addition to partitions. How did you verify/test it Verified on a device with privileged and unprivileged containers. Signed-off-by: Nate White <nate@nexthop.ai> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
What is the motivation for this PR Use VM_VNI as the outer VNI for PL outbound packets in the VM (non‑floating NIC) scenario. How did you do it Set the PL outbound packet outer VNI to VM_VNI for non‑floating NIC cases. How did you verify/test it Ran DASH PL test on SN4280 light mode testbed; passed. Signed-off-by: Cong Hou <congh@nvidia.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Added utility to compare flows from primary and secondary DPUS on a HA configuration
Tested in PDB:
DUT1:
Flow-table-10
27955202 1 H 10.0.0.11 10.2.0.100 UDP 6789 4567 I A NA NA
27955202 1 U 2603:10e1:100:2::3401:203 fd41:108:20:d107:64:ff71:a00:b UDP 4567 6789 R A NA NA
No. of flows: 2
(Pdb) p flow_tables
{'Flow-table-10': [
{'Session': '27955202', 'LookupId': '1', 'Dir': 'H', 'SIP': '10.0.0.11', 'DIP': '10.2.0.100', 'Proto': 'UDP', 'Sport': '6789', 'Dport': '4567', 'Role': 'I', 'Action': 'A'},
{'Session': '27955202', 'LookupId': '1', 'Dir': 'U', 'SIP': '2603:10e1:100:2::3401:203', 'DIP': 'fd41:108:20:d107:64:ff71:a00:b', 'Proto': 'UDP', 'Sport': '4567', 'Dport': '6789', 'Role': 'R', 'Action': 'A'}]}
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…d_timer_accuracy (sonic-net#22250) What is the motivation for this PR? The error message “IndexError: list index out of range” is unclear and potentially misleading. How did you do it? Modify the output error message, check the timestamp sample before accessing the list How did you verify/test it? run elastic test 202505 https://elastictest.org/scheduler/testplan/69841fb948d58f009f2c7154 202511 https://elastictest.org/scheduler/testplan/69842120bb7d1dad4c803d16 inject failure https://elastictest.org/scheduler/testplan/698482c1bb7d1dad4c803dc8 > pytest.fail( "Too many iterations failed to collect PFCWD timestamps. " "Detect time samples: {}/{} (failures: {}), Restore time samples: {}/{} (failures: {}). " "Required at least {} samples. This may indicate environment or timing issues.".format( detect_count, ITERATION_NUM, detect_failures, restore_count, ITERATION_NUM, restore_failures, required_samples)) E Failed: Too many iterations failed to collect PFCWD timestamps. Detect time samples: 0/20 (failures: 20) Signed-off-by: xuliping <xuliping@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…-net#22494) What is the motivation for this PR? autorestart/test_container_autorestart.py::test_containers_autorestart, this is a case related to the container restart, need to skip the memory checker during the test simulate_small_var_log_partition memory checker collects memory before and after the case, so we need to make sure the test environment is stable before the case, it’s better to add the check for interface up and wait for bgp up in the fixture simulate_small_var_log_partition, config reload. config_reload(duthost, safe_reload=True) config_reload(duthost, safe_reload=True, check_intf_up_ports=True, wait_for_bgp=True) How did you do it? Skip the checker in autorestart/test_container_autorestart.py Waiting device stable before testing Signed-off-by: xuliping <xuliping@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Signed-off-by: SRAVANI KANASANI <kanasanis@google.com> Co-authored-by: kishanps <kishanps@google.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
… cisco_hwskus (sonic-net#22437) What is the motivation for this PR: Fix the failure Cannot identify DUT ASIC type in test_qos_sai.py How did you do it: Added Cisco-8101-32FH-O to ansible variable cisco-8000_gr2_hwskus and cisco_hwskus How did you verify/test it: Run test_qos_sai.py. Signed-off-by: Eduard Yakubchyk <eyakubch@cisco.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…2482) What is the motivation for this PR: SAI spec states that DIP_LINK_LOCAL/SIP_LINK_LOCAL drop reasons are only for IPv4, so we need to skip the tests for V6 topo How did you do it: Revert V6 topo changes and skip the test for everyone by removing Boradcom ASIC condtiion How did you verify/test it: Tests are skipped correctly on V6 topos now Signed-off-by: markxiao <markxiao@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…onic-net#22431) What is the motivation for this PR: The test ospf/test_ospf_bfd.py was failing during teardown with a YANG validation error because a stale COPP_TRAPospf entry remained in CONFIG_DB when the test was skipped. How did you do it: Add config_reload to trap_copp_ospf teardown in tests/ospf/conftest.py so CONFIG_DB is restored to a clean state even when downstream fixtures skip. How did you verify/test it: The test was skipped as expected and the YANG validation failure was resolved. Signed-off-by: manish1 <manish1@arista.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
b15859c to
d1b43c4
Compare
|
My actions made a mess of my PR. I will re-raise a different PR. Closing this. |
|
/azp run |
|
Reraised this PR as: #22711 |
Description of PR
Summary:
Adding checks for both interface counters and drop counters in continuous link flap test. This is to catch any problem that will cause the counters to go up drastically when the interfaces are flapping, or going down.
Type of change
Back port request
Approach
What is the motivation for this PR?
We saw some issues in interfaces where the ports were flapping, and the drop counters were going up drastically. This PR attempts to modify the existing test case to catch this condition.
How did you do it?
Added drop and interface counter checks to the test_cont_link_flap test.
How did you verify/test it?
Ran it on our platform: