Issue 21168: [sflow] wait_for() instead of time.sleep()#21195
Issue 21168: [sflow] wait_for() instead of time.sleep()#21195anders-nexthop wants to merge 3 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
c0be997 to
fd45c63
Compare
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
fd45c63 to
4797038
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
4797038 to
f408447
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
f408447 to
5d6c040
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
5d6c040 to
c5723ad
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
New conflicts came in just this morning, I've updated the PR acordingly. |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Thanks for the review @yxieca I have addressed your review comments and fixed the merge conflicts. The test is passing in CI/CD, but it seems there are unrelated test failures cropping up in other areas. I will try to re-run the tests later to see if I can get everything passing. |
|
@anders-nexthop can you address the merge conflict? |
e84c38e to
044784b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
I have fixed the merge conflicts. I see that the test is passing now, although other tests are not (they look unrelated to me). |
044784b to
bdc176b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
test_sflow.py uses sleep() to orchestrate test steps, which is flaky and prone to timing issues. Replace sleep() coordination with wait_for() where relevant. Use thread events to wait for collector threads to be ready. Signed-off-by: Anders Linn <anders@nexthop.ai>
Signed-off-by: Anders Linn <anders@nexthop.ai>
Signed-off-by: Anders Linn <anders@nexthop.ai>
bdc176b to
0040ae2
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
Summary:
test_sflow.py uses sleep() to orchestrate test steps, which is flaky and prone to timing issues. Replace sleep() coordination with wait_for() where relevant. Use thread events to wait for collector threads to be ready.
Closes #21168
Also, there's an issue with SYSTEM_READY status not being set correctly, which is causing hsflowd to wait around for 180s before sending any samples. This causes intermittent failures in the first few test cases, and is presumably the issue for which this test is being skipped in Issue 21701. This PR addresses that problem by adding an explicit wait_until for that condition, so that if the condition is never met we will just match the timeout being used by hsflowd and the test will run as normal.
Closes #21701
Type of change
Back port request
Approach
What is the motivation for this PR?
I was having problems getting this test to pass on our testing infrastructure. I ended up having to increase several sleep() call durations to get it working, which was not something we wanted to upstream and was also just kicking the can down the road. Due to the nature of the feature under test, there will always be some variability, but a lot of that can be reduced by figuring out what actual conditions we need to wait on. The result should be a little faster too, since we don't end up waiting when we don't have to.
How did you do it?
I changed the main test cases to wait_for the results of the PTF script, so that if sample hits outside the accepted bounds it has a chance to retry (since the sampling is non-deterministic, we expect to occasionally have test runs which fall outside the accepted bounds).
I added logic to verify that the sflow container is up, and that
hsflowdis up, after configuring the feature (so that the PTF script is guaranteed to run after the sflow feature on the test device is up and running).I added thread synchronization to the collector threads, so that the main thread will wait for the collectors to start up before sending traffic.
For the flow sample case, I added logic to count the number of samples seen so far and wait dynamically based on whether sample packets are arriving or not. If no packets ever arrive (if the device is not responding, for instance) the test times out and fails. As long as packets are arriving the collectors will keep waiting indefinitely. If the expected number of packets is seen, and more packets keep arriving, the test will wait for a short period to try and catch cases where too many packets are being sent.
How did you verify/test it?
Verified that the test passes successfully with the changes, and that no unexpected log messages are seen. Stepped through the new logic in a debugger to manually verify that the codepaths are working as intended.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation