Skip to content

Issue 21168: [sflow] wait_for() instead of time.sleep()#21195

Open
anders-nexthop wants to merge 3 commits intosonic-net:masterfrom
nexthop-ai:anders.21168.sflow-wait-for
Open

Issue 21168: [sflow] wait_for() instead of time.sleep()#21195
anders-nexthop wants to merge 3 commits intosonic-net:masterfrom
nexthop-ai:anders.21168.sflow-wait-for

Conversation

@anders-nexthop
Copy link
Contributor

@anders-nexthop anders-nexthop commented Nov 4, 2025

Description of PR

Summary:
test_sflow.py uses sleep() to orchestrate test steps, which is flaky and prone to timing issues. Replace sleep() coordination with wait_for() where relevant. Use thread events to wait for collector threads to be ready.

Closes #21168

Also, there's an issue with SYSTEM_READY status not being set correctly, which is causing hsflowd to wait around for 180s before sending any samples. This causes intermittent failures in the first few test cases, and is presumably the issue for which this test is being skipped in Issue 21701. This PR addresses that problem by adding an explicit wait_until for that condition, so that if the condition is never met we will just match the timeout being used by hsflowd and the test will run as normal.

Closes #21701

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

I was having problems getting this test to pass on our testing infrastructure. I ended up having to increase several sleep() call durations to get it working, which was not something we wanted to upstream and was also just kicking the can down the road. Due to the nature of the feature under test, there will always be some variability, but a lot of that can be reduced by figuring out what actual conditions we need to wait on. The result should be a little faster too, since we don't end up waiting when we don't have to.

How did you do it?

I changed the main test cases to wait_for the results of the PTF script, so that if sample hits outside the accepted bounds it has a chance to retry (since the sampling is non-deterministic, we expect to occasionally have test runs which fall outside the accepted bounds).

I added logic to verify that the sflow container is up, and that hsflowd is up, after configuring the feature (so that the PTF script is guaranteed to run after the sflow feature on the test device is up and running).

I added thread synchronization to the collector threads, so that the main thread will wait for the collectors to start up before sending traffic.

For the flow sample case, I added logic to count the number of samples seen so far and wait dynamically based on whether sample packets are arriving or not. If no packets ever arrive (if the device is not responding, for instance) the test times out and fails. As long as packets are arriving the collectors will keep waiting indefinitely. If the expected number of packets is seen, and more packets keep arriving, the test will wait for a short period to try and catch cases where too many packets are being sent.

How did you verify/test it?

Verified that the test passes successfully with the changes, and that no unexpected log messages are seen. Stepped through the new logic in a debugger to manually verify that the codepaths are working as intended.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop anders-nexthop marked this pull request as ready for review November 4, 2025 17:55
@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from c0be997 to fd45c63 Compare November 17, 2025 19:00
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from fd45c63 to 4797038 Compare November 17, 2025 19:08
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from 4797038 to f408447 Compare November 18, 2025 22:33
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

Nits / questions:

  1. now uses . If this param is already a list (not a string), this may break. Can we confirm the type and guard accordingly?
  2. New logic assumes enabled_sflow_interfaces is present; otherwise it uses all interfaces. Looks OK but just confirming intent.

Also, PR currently has merge conflicts (mergeState=DIRTY). Please resolve the conflicts.

  1. In analyze_sflow_sample() the existing code already assumes the param is a list, the current changes are consistent with that assumption. But you're right, we should guard against this, I'll add a check.

  2. There is actually no defined behavior if enabled_sflow_interfaces is not present, and this would crash in analyze_sflow_sample() because self.enabled_intf would be undefined. This really only happens in the polling test case, which doesn't use any of the flow sample logic, but we should have a sane default anyway. I was defaulting to "all interfaces", but we can default to "no interfaces" instead, which is more inline with what the original code implied.

New conflicts came in just this morning, I've updated the PR acordingly.

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

Nits / questions:

  1. now uses . If this param is already a list (not a string), this may break. Can we confirm the type and guard accordingly?
  2. New logic assumes enabled_sflow_interfaces is present; otherwise it uses all interfaces. Looks OK but just confirming intent.

Also, PR currently has merge conflicts (mergeState=DIRTY). Please resolve the conflicts.

Thanks for the review @yxieca I have addressed your review comments and fixed the merge conflicts. The test is passing in CI/CD, but it seems there are unrelated test failures cropping up in other areas. I will try to re-run the tests later to see if I can get everything passing.

@yxieca
Copy link
Collaborator

yxieca commented Feb 27, 2026

@anders-nexthop can you address the merge conflict?

@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from e84c38e to 044784b Compare February 28, 2026 00:22
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

@anders-nexthop can you address the merge conflict?

I have fixed the merge conflicts. I see that the test is passing now, although other tests are not (they look unrelated to me).

@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from 044784b to bdc176b Compare March 2, 2026 16:29
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anders-nexthop
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

test_sflow.py uses sleep() to orchestrate test steps, which is flaky and
prone to timing issues. Replace sleep() coordination with wait_for()
where relevant. Use thread events to wait for collector threads to be
ready.

Signed-off-by: Anders Linn <anders@nexthop.ai>
Signed-off-by: Anders Linn <anders@nexthop.ai>
Signed-off-by: Anders Linn <anders@nexthop.ai>
@anders-nexthop anders-nexthop force-pushed the anders.21168.sflow-wait-for branch from bdc176b to 0040ae2 Compare March 6, 2026 20:04
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test gap] Need to fix the current failures for test_sflow.py Enhancement: [sflow] wait_for conditions rather than use sleep() for test orchestration

4 participants