Skip to content

[platform] Add test_hw_watchdog_remaining_time to validate timeout range (#22491)#22784

Open
yxieca wants to merge 6 commits intosonic-net:masterfrom
yxieca:test/hw-watchdog-remaining-time
Open

[platform] Add test_hw_watchdog_remaining_time to validate timeout range (#22491)#22784
yxieca wants to merge 6 commits intosonic-net:masterfrom
yxieca:test/hw-watchdog-remaining-time

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Mar 6, 2026

Description of PR

Summary: Add test_hw_watchdog_remaining_time to verify that the hardware watchdog remaining timeout falls within a sane range (30-300 seconds). Platforms occasionally misconfigure absurdly short or long watchdog timeouts, which can cause either premature reboots (<30s) or ineffective watchdog protection (>300s).

Fixes #22491

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Issue #22491 identified a test gap: there is no validation that the hardware watchdog timeout is within a reasonable range. A too-short timeout (<30s) can cause premature reboots during normal load spikes, while a too-long timeout (>300s) renders the watchdog ineffective at recovering from hangs.

This was identified during review of PR #22361 which added test_hw_watchdog_supported and test_hw_watchdog_armed.

How did you do it?

Added test_hw_watchdog_remaining_time to tests/platform_tests/test_hw_watchdog.py:

  • Runs watchdogutil status and parses the "Time remaining: N seconds" output
  • Asserts the remaining time is within 30-300 seconds
  • Skips gracefully when watchdog is unarmed or unsupported
  • Extracted _parse_remaining_time() helper for clean parsing with regex

How did you verify/test it?

This test requires a physical platform with hardware watchdog support. Syntax and lint checks pass locally.

platform_tests/test_hw_watchdog.py::test_hw_watchdog_remaining_time[str2-8101c1-09] PASSED [100%]DEBUG:tests.conftest:[log_custom_msg] item: <Function test_hw_watchdog_remaining_time

Any platform specific information?

Test applies to all physical platforms with watchdogutil support. Marked device_type('physical').

Supported testbed topology if it's a new test case?

any — this test only interacts with the DUT via watchdogutil status.

…nge (sonic-net#22491)

Add test to verify the hardware watchdog remaining timeout falls within
a sane range of 30-300 seconds. Platforms occasionally misconfigure
absurdly short or long watchdog timeouts, causing premature reboots
(<30s) or ineffective watchdog protection (>300s).

- Parse "Time remaining: N seconds" from watchdogutil status output
- Skip gracefully when watchdog is unarmed or unsupported
- Extract parsing logic to _parse_remaining_time helper for clarity

Fixes sonic-net#22491

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@yxieca
Copy link
Collaborator Author

yxieca commented Mar 6, 2026

This PR was raised by an AI agent on behalf of Ying Xie.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Add fixture that temporarily arms the watchdog if unarmed, and
restores the original state after the test. This ensures the
remaining time test runs on platforms where watchdog is not
armed by default.

- ensure_watchdog_armed: arms if needed, yields was_armed, disarms in cleanup
- test_hw_watchdog_remaining_time now uses the fixture instead of skipping

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Replace fragile substring check ("armed" in / "unarmed" not in)
with _is_watchdog_armed() that matches "Status: Armed" as a full
line. The old check was confusing because "armed" is a substring
of "unarmed".

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Some platform drivers always return 0 from get_remaining_time()
even though the watchdog is armed and functional (confirmed on
Arista 7260). Treat remaining_time==0 as a skip with warning
rather than a test failure, since the watchdog itself works.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…g 0s"

This reverts commit 72461bd.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@yxieca yxieca force-pushed the test/hw-watchdog-remaining-time branch from 4e99c19 to 111c473 Compare March 7, 2026 00:28
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

On VS and other platforms without hardware watchdog, watchdogutil
returns rc=1. Skip with a clear message instead of failing, since
the test is only meaningful on physical hardware.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test gap: Add test_hw_watchdog_remaining_time to validate watchdog timeout range

2 participants