Skip to content

[core][autoscaler][v1] deflaky test_autoscaler #52769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented May 3, 2025

Why are these changes needed?

From the logs provided by @kevin85421, test_autoscaler.py has 2 flaky tests:

[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] =================================== FAILURES ===================================
[2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes>
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z]     def testConfiguresNewNodes(self):
[2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 1
[2025-04-29T20:28:44Z]         config_path = self.write_config(config)
[2025-04-29T20:28:44Z]         self.provider = MockProvider()
[2025-04-29T20:28:44Z]         runner = MockProcessRunner()
[2025-04-29T20:28:44Z]         runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)])
[2025-04-29T20:28:44Z]         self.provider.create_node(
[2025-04-29T20:28:44Z]             {},
[2025-04-29T20:28:44Z]             {
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
[2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
[2025-04-29T20:28:44Z]             },
[2025-04-29T20:28:44Z]             1,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
[2025-04-29T20:28:44Z]             config_path,
[2025-04-29T20:28:44Z]             LoadMetrics(),
[2025-04-29T20:28:44Z]             MockGcsClient(),
[2025-04-29T20:28:44Z]             max_failures=0,
[2025-04-29T20:28:44Z]             process_runner=runner,
[2025-04-29T20:28:44Z]             update_interval_s=0,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         self.waitForNodes(2)
[2025-04-29T20:28:44Z]         self.provider.finish_starting_nodes()
[2025-04-29T20:28:44Z]         # TODO(rickyx): This is a hack to avoid running into race conditions
[2025-04-29T20:28:44Z]         # within v1 autoscaler. These should no longer be relevant in v2.
[2025-04-29T20:28:44Z]         time.sleep(3)
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         time.sleep(3)
[2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE})
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: 
[2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
[2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
[2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.

and

[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] =================================== FAILURES ===================================
[2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups>
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z]     def testDontScaleDownIdleTimeOutForPlacementGroups(self):
[2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
[2025-04-29T20:28:44Z]         config["available_node_types"]["head"]["resources"][
[2025-04-29T20:28:44Z]             "CPU"
[2025-04-29T20:28:44Z]         ] = 0  # make the head node not consume any resources.
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"][
[2025-04-29T20:28:44Z]             "min_workers"
[2025-04-29T20:28:44Z]         ] = 1  # prepare 1 worker upfront.
[2025-04-29T20:28:44Z]         config["idle_timeout_minutes"] = 0.1
[2025-04-29T20:28:44Z]         config_path = self.write_config(config)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         self.provider = MockProvider()
[2025-04-29T20:28:44Z]         self.provider.create_node(
[2025-04-29T20:28:44Z]             {},
[2025-04-29T20:28:44Z]             {
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
[2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
[2025-04-29T20:28:44Z]             },
[2025-04-29T20:28:44Z]             1,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         runner = MockProcessRunner()
[2025-04-29T20:28:44Z]         lm = LoadMetrics()
[2025-04-29T20:28:44Z]         mock_gcs_client = MockGcsClient()
[2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
[2025-04-29T20:28:44Z]             config_path,
[2025-04-29T20:28:44Z]             lm,
[2025-04-29T20:28:44Z]             mock_gcs_client,
[2025-04-29T20:28:44Z]             max_failures=0,
[2025-04-29T20:28:44Z]             process_runner=runner,
[2025-04-29T20:28:44Z]             update_interval_s=0,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         # 1 worker is ready upfront.
[2025-04-29T20:28:44Z]         self.waitForNodes(1, tag_filters=WORKER_FILTER)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Restore min_workers to allow scaling down to 0.
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 0
[2025-04-29T20:28:44Z]         self.write_config(config)
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Create a placement group with 2 bundles that require 2 workers.
[2025-04-29T20:28:44Z]         placement_group_table_data = gcs_pb2.PlacementGroupTableData(
[2025-04-29T20:28:44Z]             placement_group_id=b"\000",
[2025-04-29T20:28:44Z]             strategy=common_pb2.PlacementStrategy.SPREAD,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         for i in range(2):
[2025-04-29T20:28:44Z]             bundle = common_pb2.Bundle()
[2025-04-29T20:28:44Z]             bundle.bundle_id.placement_group_id = (
[2025-04-29T20:28:44Z]                 placement_group_table_data.placement_group_id
[2025-04-29T20:28:44Z]             )
[2025-04-29T20:28:44Z]             bundle.bundle_id.bundle_index = i
[2025-04-29T20:28:44Z]             bundle.unit_resources["CPU"] = 1
[2025-04-29T20:28:44Z]             placement_group_table_data.bundles.append(bundle)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group.
[2025-04-29T20:28:44Z]         worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0]
[2025-04-29T20:28:44Z]         lm.update(
[2025-04-29T20:28:44Z]             worker_ip,
[2025-04-29T20:28:44Z]             mock_raylet_id(),
[2025-04-29T20:28:44Z]             {"CPU": 1},
[2025-04-29T20:28:44Z]             {"CPU": 1},
[2025-04-29T20:28:44Z]             20,  # idle for 20 seconds, which is longer than the idle_timeout_minutes.
[2025-04-29T20:28:44Z]             None,
[2025-04-29T20:28:44Z]             None,
[2025-04-29T20:28:44Z]             [placement_group_table_data],
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         events = autoscaler.event_summarizer.summary()
[2025-04-29T20:28:44Z]         assert "Removing 1 nodes of type worker (idle)." not in events, events
[2025-04-29T20:28:44Z]         assert "Adding 1 node(s) of type worker." in events, events
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters=WORKER_FILTER)
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: 
[2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
[2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
[2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.

They both overprovisioned work nodes (AssertionError: 3 != 2) due to the race between autoscaler.update() and the background NodeLauncher. In particular, the pending_launches counter in the autoscaler will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of autoscaler.update() and thus let it overprovision a new node.

The previous solution is adding time.sleep(3) between autoscaler.update() calls.

# TODO(rickyx): This is a hack to avoid running into race conditions
# within v1 autoscaler. These should no longer be relevant in v2.
time.sleep(3)

I think we can make it more reliable by using self.waitForNodes() instead.

This PR fixes these two flaky tests by adding self.waitForNodes() between autoscaler.update().

It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of testDontScaleDownIdleTimeOutForPlacementGroups.

Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures.

Related issue number

#52768

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rueian rueian marked this pull request as ready for review May 4, 2025 04:52
@rueian rueian requested a review from kevin85421 May 4, 2025 05:45
@rueian rueian force-pushed the deflaky-autoscaler-v1 branch from 126b569 to c3c2ebd Compare May 4, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant