[8.18] (backport #17028) Smoke test os retry event assertions #17066
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation/summary
Smoke tests intermittently fail when asserting events have been indexed to ES:
The failures were initially believed to be caused by a data race where the smoke test was reading events before they were flushed and indexed. Added a retries and wait to event assertion code so the test do not fail right away. The
flush_interval
was also decreased to reduce the amount of time the smoke test has to wait and retry.The retry function was adapted from the buildkite scripts found here.
Please let me know if you think we need to adjust the number of retries, wait times, or the
flush_interval
.Checklist
For functional changes, consider:
How to test these changes
The below script was used to validate the retry control flow locally:
The smoke tests os workflow was ran multiple times

Validated logs show missed events being retried.
example for one retry
example with multiple retries
Related issues
Closes #16905
This is an automatic backport of pull request #17028 done by Mergify.