Issue 21168: [sflow] wait_for() instead of time.sleep() by anders-nexthop · Pull Request #21195 · sonic-net/sonic-mgmt

anders-nexthop · 2025-11-04T07:08:29Z

Description of PR

Summary:
test_sflow.py uses sleep() to orchestrate test steps, which is flaky and prone to timing issues. Replace sleep() coordination with wait_for() where relevant. Use thread events to wait for collector threads to be ready.

Closes #21168

Also, there's an issue with SYSTEM_READY status not being set correctly, which is causing hsflowd to wait around for 180s before sending any samples. This causes intermittent failures in the first few test cases, and is presumably the issue for which this test is being skipped in Issue 21701. This PR addresses that problem by adding an explicit wait_until for that condition, so that if the condition is never met we will just match the timeout being used by hsflowd and the test will run as normal.

Closes #21701

Type of change

Back port request

Approach

What is the motivation for this PR?

I was having problems getting this test to pass on our testing infrastructure. I ended up having to increase several sleep() call durations to get it working, which was not something we wanted to upstream and was also just kicking the can down the road. Due to the nature of the feature under test, there will always be some variability, but a lot of that can be reduced by figuring out what actual conditions we need to wait on. The result should be a little faster too, since we don't end up waiting when we don't have to.

How did you do it?

I changed the main test cases to wait_for the results of the PTF script, so that if sample hits outside the accepted bounds it has a chance to retry (since the sampling is non-deterministic, we expect to occasionally have test runs which fall outside the accepted bounds).

I added logic to verify that the sflow container is up, and that hsflowd is up, after configuring the feature (so that the PTF script is guaranteed to run after the sflow feature on the test device is up and running).

I added thread synchronization to the collector threads, so that the main thread will wait for the collectors to start up before sending traffic.

For the flow sample case, I added logic to count the number of samples seen so far and wait dynamically based on whether sample packets are arriving or not. If no packets ever arrive (if the device is not responding, for instance) the test times out and fails. As long as packets are arriving the collectors will keep waiting indefinitely. If the expected number of packets is seen, and more packets keep arriving, the test will wait for a short period to try and catch cases where too many packets are being sent.

How did you verify/test it?

Verified that the test passes successfully with the changes, and that no unexpected log messages are seen. Stepped through the new logic in a debugger to manually verify that the codepaths are working as intended.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2025-11-04T07:08:37Z

/azp run

azure-pipelines · 2025-11-04T07:08:51Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-11-17T19:01:03Z

/azp run

azure-pipelines · 2025-11-17T19:01:10Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

mssonicbld · 2025-11-17T19:08:52Z

/azp run

azure-pipelines · 2025-11-17T19:09:08Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2025-11-17T21:33:26Z

/azpw run

mssonicbld · 2025-11-17T21:33:29Z

/AzurePipelines run

azure-pipelines · 2025-11-17T21:33:44Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2025-11-18T16:33:25Z

/azpw run

mssonicbld · 2025-11-18T16:33:27Z

/AzurePipelines run

azure-pipelines · 2025-11-18T16:33:44Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-11-18T22:33:37Z

/azp run

azure-pipelines · 2025-11-18T22:33:50Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-02-17T22:06:45Z

/azp run

azure-pipelines · 2026-02-17T22:07:00Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-02-18T06:08:55Z

/azp run

azure-pipelines · 2026-02-18T06:09:10Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-02-27T01:36:54Z

/azp run

azure-pipelines · 2026-02-27T01:37:08Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-02-27T04:30:43Z

Nits / questions:

now uses . If this param is already a list (not a string), this may break. Can we confirm the type and guard accordingly?

New logic assumes enabled_sflow_interfaces is present; otherwise it uses all interfaces. Looks OK but just confirming intent.

Also, PR currently has merge conflicts (mergeState=DIRTY). Please resolve the conflicts.

In analyze_sflow_sample() the existing code already assumes the param is a list, the current changes are consistent with that assumption. But you're right, we should guard against this, I'll add a check.
There is actually no defined behavior if enabled_sflow_interfaces is not present, and this would crash in analyze_sflow_sample() because self.enabled_intf would be undefined. This really only happens in the polling test case, which doesn't use any of the flow sample logic, but we should have a sane default anyway. I was defaulting to "all interfaces", but we can default to "no interfaces" instead, which is more inline with what the original code implied.

New conflicts came in just this morning, I've updated the PR acordingly.

anders-nexthop · 2026-02-27T04:52:39Z

/azpw run

mssonicbld · 2026-02-27T04:52:41Z

/AzurePipelines run

azure-pipelines · 2026-02-27T04:52:56Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-02-27T05:21:29Z

Nits / questions:

now uses . If this param is already a list (not a string), this may break. Can we confirm the type and guard accordingly?

New logic assumes enabled_sflow_interfaces is present; otherwise it uses all interfaces. Looks OK but just confirming intent.

Also, PR currently has merge conflicts (mergeState=DIRTY). Please resolve the conflicts.

Thanks for the review @yxieca I have addressed your review comments and fixed the merge conflicts. The test is passing in CI/CD, but it seems there are unrelated test failures cropping up in other areas. I will try to re-run the tests later to see if I can get everything passing.

yxieca · 2026-02-27T22:49:26Z

@anders-nexthop can you address the merge conflict?

mssonicbld · 2026-02-28T00:22:38Z

/azp run

azure-pipelines · 2026-02-28T00:22:52Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-02-28T03:35:11Z

@anders-nexthop can you address the merge conflict?

I have fixed the merge conflicts. I see that the test is passing now, although other tests are not (they look unrelated to me).

mssonicbld · 2026-03-02T16:29:16Z

/azp run

azure-pipelines · 2026-03-02T16:29:31Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-03-03T20:50:47Z

/azpw run

mssonicbld · 2026-03-03T20:50:49Z

/AzurePipelines run

azure-pipelines · 2026-03-03T20:51:03Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-03-05T20:26:28Z

/azpw run

mssonicbld · 2026-03-05T20:26:30Z

/AzurePipelines run

azure-pipelines · 2026-03-05T20:26:46Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop · 2026-03-06T19:53:24Z

/azpw run

mssonicbld · 2026-03-06T19:53:27Z

/AzurePipelines run

azure-pipelines · 2026-03-06T19:53:42Z

Azure Pipelines successfully started running 1 pipeline(s).

test_sflow.py uses sleep() to orchestrate test steps, which is flaky and prone to timing issues. Replace sleep() coordination with wait_for() where relevant. Use thread events to wait for collector threads to be ready. Signed-off-by: Anders Linn <anders@nexthop.ai>

Signed-off-by: Anders Linn <anders@nexthop.ai>

mssonicbld · 2026-03-06T20:04:43Z

/azp run

azure-pipelines · 2026-03-06T20:05:01Z

Azure Pipelines successfully started running 1 pipeline(s).

anders-nexthop marked this pull request as ready for review November 4, 2025 17:55

anders-nexthop requested review from wangxin and yxieca as code owners November 4, 2025 17:55

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from c0be997 to fd45c63 Compare November 17, 2025 19:00

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from fd45c63 to 4797038 Compare November 17, 2025 19:08

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from 4797038 to f408447 Compare November 18, 2025 22:33

sonicly1g mentioned this pull request Feb 13, 2026

[bgp] Replace time.sleep with wait_until for BGP session check in test_bgp_speaker #22402

Merged

11 tasks

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from f408447 to 5d6c040 Compare February 17, 2026 22:06

github-actions bot requested review from BYGX-wcr, xwjiang-ms and yutongzhang-microsoft February 17, 2026 22:07

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from 5d6c040 to c5723ad Compare February 18, 2026 06:08

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from e84c38e to 044784b Compare February 28, 2026 00:22

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from 044784b to bdc176b Compare March 2, 2026 16:29

anders-nexthop added 3 commits March 6, 2026 20:04

minor fixes and SYSTEM_READY handling

6fc5215

Signed-off-by: Anders Linn <anders@nexthop.ai>

address review comments and fix pyright error

0040ae2

Signed-off-by: Anders Linn <anders@nexthop.ai>

anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from bdc176b to 0040ae2 Compare March 6, 2026 20:04

Conversation

anders-nexthop commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Nov 4, 2025

Uh oh!

azure-pipelines bot commented Nov 4, 2025

Uh oh!

mssonicbld commented Nov 17, 2025

Uh oh!

azure-pipelines bot commented Nov 17, 2025

Uh oh!

mssonicbld commented Nov 17, 2025

Uh oh!

azure-pipelines bot commented Nov 17, 2025

Uh oh!

anders-nexthop commented Nov 17, 2025

Uh oh!

mssonicbld commented Nov 17, 2025

Uh oh!

azure-pipelines bot commented Nov 17, 2025

Uh oh!

anders-nexthop commented Nov 18, 2025

Uh oh!

mssonicbld commented Nov 18, 2025

Uh oh!

azure-pipelines bot commented Nov 18, 2025

Uh oh!

mssonicbld commented Nov 18, 2025

Uh oh!

azure-pipelines bot commented Nov 18, 2025

Uh oh!

mssonicbld commented Feb 17, 2026

Uh oh!

azure-pipelines bot commented Feb 17, 2026

Uh oh!

mssonicbld commented Feb 18, 2026

Uh oh!

azure-pipelines bot commented Feb 18, 2026

Uh oh!

mssonicbld commented Feb 27, 2026

Uh oh!

azure-pipelines bot commented Feb 27, 2026

Uh oh!

anders-nexthop commented Feb 27, 2026

Uh oh!

anders-nexthop commented Feb 27, 2026

Uh oh!

mssonicbld commented Feb 27, 2026

Uh oh!

azure-pipelines bot commented Feb 27, 2026

Uh oh!

anders-nexthop commented Feb 27, 2026

Uh oh!

yxieca commented Feb 27, 2026

Uh oh!

mssonicbld commented Feb 28, 2026

Uh oh!

azure-pipelines bot commented Feb 28, 2026

Uh oh!

anders-nexthop commented Feb 28, 2026

Uh oh!

mssonicbld commented Mar 2, 2026

Uh oh!

azure-pipelines bot commented Mar 2, 2026

Uh oh!

anders-nexthop commented Mar 3, 2026

Uh oh!

mssonicbld commented Mar 3, 2026

Uh oh!

anders-nexthop commented Nov 4, 2025 •

edited

Loading