[core][autoscaler][v1] deflaky test_autoscaler #52769
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
From the logs provided by @kevin85421,
test_autoscaler.py
has 2 flaky tests:and
They both overprovisioned work nodes (
AssertionError: 3 != 2
) due to the race betweenautoscaler.update()
and the background NodeLauncher. In particular, thepending_launches
counter in theautoscaler
will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view ofautoscaler.update()
and thus let it overprovision a new node.The previous solution is adding
time.sleep(3)
betweenautoscaler.update()
calls.ray/python/ray/tests/test_autoscaler.py
Lines 2245 to 2247 in 8561936
I think we can make it more reliable by using
self.waitForNodes()
instead.This PR fixes these two flaky tests by adding
self.waitForNodes()
betweenautoscaler.update()
.It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of
testDontScaleDownIdleTimeOutForPlacementGroups
.Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures.
Related issue number
#52768
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.