[core][autoscaler][v1] deflaky test_autoscaler #52769

rueian · 2025-05-03T23:41:04Z

Why are these changes needed?

From the logs provided by @kevin85421, test_autoscaler.py has 2 flaky tests:

[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] =================================== FAILURES ===================================
[2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes>
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z]     def testConfiguresNewNodes(self):
[2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 1
[2025-04-29T20:28:44Z]         config_path = self.write_config(config)
[2025-04-29T20:28:44Z]         self.provider = MockProvider()
[2025-04-29T20:28:44Z]         runner = MockProcessRunner()
[2025-04-29T20:28:44Z]         runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)])
[2025-04-29T20:28:44Z]         self.provider.create_node(
[2025-04-29T20:28:44Z]             {},
[2025-04-29T20:28:44Z]             {
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
[2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
[2025-04-29T20:28:44Z]             },
[2025-04-29T20:28:44Z]             1,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
[2025-04-29T20:28:44Z]             config_path,
[2025-04-29T20:28:44Z]             LoadMetrics(),
[2025-04-29T20:28:44Z]             MockGcsClient(),
[2025-04-29T20:28:44Z]             max_failures=0,
[2025-04-29T20:28:44Z]             process_runner=runner,
[2025-04-29T20:28:44Z]             update_interval_s=0,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         self.waitForNodes(2)
[2025-04-29T20:28:44Z]         self.provider.finish_starting_nodes()
[2025-04-29T20:28:44Z]         # TODO(rickyx): This is a hack to avoid running into race conditions
[2025-04-29T20:28:44Z]         # within v1 autoscaler. These should no longer be relevant in v2.
[2025-04-29T20:28:44Z]         time.sleep(3)
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         time.sleep(3)
[2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE})
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: 
[2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
[2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
[2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.

and

[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] =================================== FAILURES ===================================
[2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups>
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z]     def testDontScaleDownIdleTimeOutForPlacementGroups(self):
[2025-04-29T20:28:44Z]         config = copy.deepcopy(SMALL_CLUSTER)
[2025-04-29T20:28:44Z]         config["available_node_types"]["head"]["resources"][
[2025-04-29T20:28:44Z]             "CPU"
[2025-04-29T20:28:44Z]         ] = 0  # make the head node not consume any resources.
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"][
[2025-04-29T20:28:44Z]             "min_workers"
[2025-04-29T20:28:44Z]         ] = 1  # prepare 1 worker upfront.
[2025-04-29T20:28:44Z]         config["idle_timeout_minutes"] = 0.1
[2025-04-29T20:28:44Z]         config_path = self.write_config(config)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         self.provider = MockProvider()
[2025-04-29T20:28:44Z]         self.provider.create_node(
[2025-04-29T20:28:44Z]             {},
[2025-04-29T20:28:44Z]             {
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_KIND: NODE_KIND_HEAD,
[2025-04-29T20:28:44Z]                 TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE,
[2025-04-29T20:28:44Z]                 TAG_RAY_USER_NODE_TYPE: "head",
[2025-04-29T20:28:44Z]             },
[2025-04-29T20:28:44Z]             1,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         runner = MockProcessRunner()
[2025-04-29T20:28:44Z]         lm = LoadMetrics()
[2025-04-29T20:28:44Z]         mock_gcs_client = MockGcsClient()
[2025-04-29T20:28:44Z]         autoscaler = MockAutoscaler(
[2025-04-29T20:28:44Z]             config_path,
[2025-04-29T20:28:44Z]             lm,
[2025-04-29T20:28:44Z]             mock_gcs_client,
[2025-04-29T20:28:44Z]             max_failures=0,
[2025-04-29T20:28:44Z]             process_runner=runner,
[2025-04-29T20:28:44Z]             update_interval_s=0,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]         # 1 worker is ready upfront.
[2025-04-29T20:28:44Z]         self.waitForNodes(1, tag_filters=WORKER_FILTER)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Restore min_workers to allow scaling down to 0.
[2025-04-29T20:28:44Z]         config["available_node_types"]["worker"]["min_workers"] = 0
[2025-04-29T20:28:44Z]         self.write_config(config)
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Create a placement group with 2 bundles that require 2 workers.
[2025-04-29T20:28:44Z]         placement_group_table_data = gcs_pb2.PlacementGroupTableData(
[2025-04-29T20:28:44Z]             placement_group_id=b"\000",
[2025-04-29T20:28:44Z]             strategy=common_pb2.PlacementStrategy.SPREAD,
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         for i in range(2):
[2025-04-29T20:28:44Z]             bundle = common_pb2.Bundle()
[2025-04-29T20:28:44Z]             bundle.bundle_id.placement_group_id = (
[2025-04-29T20:28:44Z]                 placement_group_table_data.placement_group_id
[2025-04-29T20:28:44Z]             )
[2025-04-29T20:28:44Z]             bundle.bundle_id.bundle_index = i
[2025-04-29T20:28:44Z]             bundle.unit_resources["CPU"] = 1
[2025-04-29T20:28:44Z]             placement_group_table_data.bundles.append(bundle)
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group.
[2025-04-29T20:28:44Z]         worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0]
[2025-04-29T20:28:44Z]         lm.update(
[2025-04-29T20:28:44Z]             worker_ip,
[2025-04-29T20:28:44Z]             mock_raylet_id(),
[2025-04-29T20:28:44Z]             {"CPU": 1},
[2025-04-29T20:28:44Z]             {"CPU": 1},
[2025-04-29T20:28:44Z]             20,  # idle for 20 seconds, which is longer than the idle_timeout_minutes.
[2025-04-29T20:28:44Z]             None,
[2025-04-29T20:28:44Z]             None,
[2025-04-29T20:28:44Z]             [placement_group_table_data],
[2025-04-29T20:28:44Z]         )
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         events = autoscaler.event_summarizer.summary()
[2025-04-29T20:28:44Z]         assert "Removing 1 nodes of type worker (idle)." not in events, events
[2025-04-29T20:28:44Z]         assert "Adding 1 node(s) of type worker." in events, events
[2025-04-29T20:28:44Z]     
[2025-04-29T20:28:44Z]         autoscaler.update()
[2025-04-29T20:28:44Z] >       self.waitForNodes(2, tag_filters=WORKER_FILTER)
[2025-04-29T20:28:44Z] 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: 
[2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes
[2025-04-29T20:28:44Z]     comparison(n, expected, msg="Unexpected node quantity.")
[2025-04-29T20:28:44Z] E   AssertionError: 3 != 2 : Unexpected node quantity.

They both overprovisioned work nodes (AssertionError: 3 != 2) due to the race between autoscaler.update() and the background NodeLauncher. In particular, the pending_launches counter in the autoscaler will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of autoscaler.update() and thus let it overprovision a new node.

The previous solution is adding time.sleep(3) between autoscaler.update() calls.

ray/python/ray/tests/test_autoscaler.py

Lines 2245 to 2247 in 8561936

    
           # TODO(rickyx): This is a hack to avoid running into race conditions 
        
           # within v1 autoscaler. These should no longer be relevant in v2. 
        
           time.sleep(3)

I think we can make it more reliable by using self.waitForNodes() instead.

This PR fixes these two flaky tests by adding self.waitForNodes() between autoscaler.update().

It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of testDontScaleDownIdleTimeOutForPlacementGroups.

Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures.

Related issue number

#52768

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Rueian <[email protected]>

kevin85421

Overall LGTM

kevin85421 · 2025-05-08T01:31:58Z

python/ray/tests/test_autoscaler.py

@@ -2239,14 +2239,13 @@ def testConfiguresNewNodes(self):
        )

        autoscaler.update()


Note:

autoscaler.update() makes autoscaling decisions based on non_terminated_nodes and pending_launches.

If autoscaler.update() decides to launch nodes, it will put the list of nodes to_launch into a queue.

The background node launcher (another thread) will async consume the queues to launch the node.

The background node launcher consumes the queue, and remove it from pending_launches but before updating non_terminated_nodes.

The second autoscaler.update() is called, so a new node is created unexpectedly.

kevin85421 · 2025-05-08T01:40:02Z

python/ray/tests/test_autoscaler.py

@@ -3652,6 +3651,8 @@ def testDontScaleDownIdleTimeOutForPlacementGroups(self):
        )

        runner = MockProcessRunner()
+        # Avoid the "Unable to deserialize `image_env` to Python object" error in the DockerCommandRunner.
+        runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)])


Mock check_output.

kevin85421 · 2025-05-08T01:54:18Z

python/ray/tests/test_autoscaler.py

@@ -3666,6 +3667,8 @@ def testDontScaleDownIdleTimeOutForPlacementGroups(self):
        autoscaler.update()
        # 1 worker is ready upfront.
        self.waitForNodes(1, tag_filters=WORKER_FILTER)
+        # clear the summary for later check.
+        autoscaler.event_summarizer.clear()

        # Restore min_workers to allow scaling down to 0.
        config["available_node_types"]["worker"]["min_workers"] = 0


remove L3676

Signed-off-by: Rueian <[email protected]>

kevin85421 · 2025-05-08T03:46:32Z

cc @rueian please ping me when all CI tests pass.

rueian · 2025-05-08T23:02:54Z

Hi @kevin85421, all tests are passed.

kevin85421 · 2025-05-09T20:57:08Z

@jjyao @edoakes would you mind merging this PR? Thanks!

jjyao · 2025-05-09T21:15:45Z

python/ray/tests/test_autoscaler.py

@@ -2239,14 +2239,13 @@ def testConfiguresNewNodes(self):
        )

        autoscaler.update()
+        # TODO(rueian): This is a hack to avoid running into race conditions
+        # within v1 autoscaler. These should no longer be relevant in v2.
+        self.waitForNodes(2)


If this is not needed for v2, can we put it under a if statement and only do it if we are testing v1?

Hi @jjyao, these two tests utilize MockAutoscaler, which directly inherits from the v1 autoscaler, meaning that we only do the hack for testing the v1 autoscaler already.

commit 0abf03bb30a7c234a0820dc4650b6df6d0cbea59 Author: srinathk10 <[email protected]> Date: Mon May 12 15:25:17 2025 -0700 Train Tests: Disable cgroup isolation on head node for benchmarking (#52909) --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: srinathk10 <[email protected]> Co-authored-by: lanbochen-anyscale <[email protected]> commit 5349c66c6d5d022f03341b0ac9f1adb34079b0a5 Author: dev-goyal <[email protected]> Date: Mon May 12 18:11:07 2025 -0400 Minor enhancements to Databricks Unity Datasource (#52850) - Move imports around in `read_databricks_tables`. Now, installing `pyspark` is optional if desired. - Print a reason if the query fails - Expose the `is_truncated` field to the user, so they can intervene if needed. Signed-off-by: Dev <[email protected]> commit da52b137f10567e78fb0dd1937a7480cf70f56ee Author: Matthew Owen <[email protected]> Date: Mon May 12 13:46:01 2025 -0700 [data] Remove unused allocated bytes panel and stat (#52943) ## Why are these changes needed? We do not update this stat anywhere in the codebase, this removes the stat and the corresponding panel. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Matthew Owen <[email protected]> commit 61617ccf1e0280ae512991eadd1595f6d1a66f15 Author: matthewdeng <[email protected]> Date: Mon May 12 11:55:14 2025 -0700 [train] bump test_torch_device_manager timeout (#52917) Test started flakily timing out. Bumping to verify if it's around the threshold. Signed-off-by: Matthew Deng <[email protected]> commit 7e78c5aee84cf9d05ec8cc6a60d385e7f6df67e7 Author: Lonnie Liu <[email protected]> Date: Mon May 12 11:10:05 2025 -0700 [data] skip tfx-bsl tests on premerge (#52942) the base image is not resolving dependencies any more. Signed-off-by: Lonnie Liu <[email protected]> commit 0f864d70cfe812f0164cd6bb414daecbcbb6e8c7 Author: Rueian <[email protected]> Date: Mon May 12 09:56:14 2025 -0700 [core][autoscaler][v1] deflaky test_autoscaler (#52769) ## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number https://github.com/ray-project/ray/issues/52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> commit 5ecb0e51273a00e14cc5878d28ac848a526c9aeb Author: Wei-Cheng Lai <[email protected]> Date: Mon May 12 17:46:38 2025 +0100 [docs][tune]: fix import & replace `session.report` with `tune.report` (#52801) Updated the documentation to improve clarity. Signed-off-by: wei-chenglai <[email protected]> commit 5324339f8407050db46b58f36b68ecdaf5ef31f6 Author: iamjustinhsu <[email protected]> Date: Mon May 12 09:36:24 2025 -0700 [Data] Add save modes to file data sinks (#52900)   ## Why are these changes needed?  In write_parquet, we want to be able to support - `OVERWRITE`: (If dir present, delete then write, otherwise, just create dir, then write) A more detailed description can be found in https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes This PR was meant to address https://anyscale1.atlassian.net/browse/DATA-946, but since the other save modes weren't that much work, I added the additional following 3 from apache spark too - `IGNORE`: (if dir present, silently pass) - `ERROR`: (if dir present, throw error) - `APPEND` (this is the current behavior we have, if dir present, we append files. Any conflicting file names are overwritten) ## Related issue number attentive requesting this  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Co-authored-by: Balaji Veeramani <[email protected]> commit 5c6ccfd848d61eed32d25378a2fb7b65a7c65119 Author: Kai-Hsun Chen <[email protected]> Date: Mon May 12 09:23:00 2025 -0700 [core][refactor] Remove `GetSequenceNumber` (#52936) Signed-off-by: Kai-Hsun Chen <[email protected]> commit 983d1ab957fb489001c4d5ae1835f48477f49f71 Author: David Xia <[email protected]> Date: Mon May 12 01:26:15 2025 -0400 [Doc] improve prometheus-grafana.md (#52821) Signed-off-by: David Xia <[email protected]> commit 66b19d390d156635c32403226d6d6c6e82fb079d Author: lkchen <[email protected]> Date: Sat May 10 12:12:10 2025 -0700 [ray.data.llm] Unify fields in SGLang and vLLM config (#52823) Signed-off-by: Linkun Chen <[email protected]> Signed-off-by: lkchen <[email protected]> commit e7bff7f09e7f5f75603c2a301a9fb19706381dbc Author: Philipp Moritz <[email protected]> Date: Sat May 10 13:31:50 2025 +0800 Fix uv run when use with vllm's Ray backend (#52916)   ## Why are these changes needed? If vllm's Ray backend is used in the vllm V1 architecture, it will start a subprocess and then call ray.init in that subprocess to launch the actual vllm replicas. This PR makes it so the uv environment still gets propagated correctly in that case. This change is consistent with the behavior of how uv environments propagate to subprocesses with just vanilla `uv run` without Ray: ``` (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat pyproject.toml [project] name = "test" version = "0.1" dependencies = [ "ray", ] ``` ``` (base) (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % cat test.py import sys import ray import subprocess import psutil print(sys.executable) print(ray.__path__) # avoid fork bomb if len(psutil.Process().parents()) > 10: sys.exit(0) subprocess.check_call([sys.executable, "test.py"]) ``` ``` (base) pcmoritz@pcmoritz-DQ44HV60WX vllm-repro % uv run test.py warning: No `requires-python` value found in the workspace. Defaulting to `>=3.12`. /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] /private/tmp/vllm-repro/.venv/bin/python3 ['/private/tmp/vllm-repro/.venv/lib/python3.12/site-packages/ray'] ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Co-authored-by: pcmoritz <[email protected]> commit f3e86752eee651ee839dc97c13d558fdb370b08e Author: Goku Mohandas <[email protected]> Date: Fri May 9 22:27:32 2025 -0700 Entity recognition with LLMs (#52342)   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: GokuMohandas <[email protected]> Signed-off-by: Goku Mohandas <[email protected]> Signed-off-by: angelinalg <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: angelinalg <[email protected]> commit 7d58cd76f00d8d96dc494f32a034f154308f9ce4 Author: Lonnie Liu <[email protected]> Date: Fri May 9 18:33:52 2025 -0700 [release] support using any dir in the repo as working dir (#52925) to support testing from docs dir Signed-off-by: Lonnie Liu <[email protected]> commit c983d99626e290e30efcd6f5bdc92e56b561d6bd Author: Dhyey Shah <[email protected]> Date: Fri May 9 17:45:12 2025 -0700 [core] Record grpc client failures (#52790) Signed-off-by: dayshah <[email protected]> commit 8addae4ed4d154ec999187277d4300cc592bfbbd Author: Christopher Zhang <[email protected]> Date: Fri May 9 17:30:34 2025 -0700 remove anyscale navbar on docs.ray.io (#52907) commit 40779a4fa92ad0b60adde92344fc52c6347ec4dd Author: Timothy Seah <[email protected]> Date: Fri May 9 17:13:50 2025 -0700 [train][doc] Remove unused configuration-overview page (#52912) Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: Timothy Seah <[email protected]> commit 257df20e399008254d0104b65d46bd52acf7a8a8 Author: Alexey Kudinkin <[email protected]> Date: Fri May 9 17:02:34 2025 -0700 [Data] Cleaning up Executor shutdown sequence (#52828) ## Why are these changes needed? 1. Log exception prompting the shutdown (if any) 2. Round durations logged (to millis) --------- Signed-off-by: Alexey Kudinkin <[email protected]> commit d258fee55dc2051ea67b2422290d11e89985a484 Author: Connector Switch <[email protected]> Date: Sat May 10 05:57:43 2025 +0800 [RLLIB] Fix simple typo in `rllib/evaluation/collectors/agent_collector.py` (#52773) Signed-off-by: Connector Switch <[email protected]> commit 5c5590895ad10b956e4ad9fba4c2cda2be68541d Author: Seiji Eicher <[email protected]> Date: Fri May 9 14:16:41 2025 -0700 [Doc] Update configure-manage-dashboard.md (#52890) Signed-off-by: Seiji Eicher <[email protected]> commit cc6790d8fc471f262395d3066b5d7bcac3241efd Author: Kai-Hsun Chen <[email protected]> Date: Fri May 9 14:13:48 2025 -0700 [chore] Delete unused build.sh (#50649) Signed-off-by: kaihsun <[email protected]> commit 86c0958a5f051780e1f4cf08ad37bde942040774 Author: Arthur Böök <[email protected]> Date: Fri May 9 13:15:03 2025 -0700 [data][llm] fix: remove-no-longer needed guided decoding vllm v0 constraint (#52903) Signed-off-by: Arthur <[email protected]> commit c769942d8251b3ab139cf823f4894340b77bb1cf Author: srinathk10 <[email protected]> Date: Fri May 9 12:03:35 2025 -0700 ImageDatasource::_read_stream Avoid unnecessary resize and convert (#52885) Signed-off-by: Srinath Krishnamachari <[email protected]> commit 3ad416c562bb4fd2ce58b93d2116573a4acc00a0 Author: Dhyey Shah <[email protected]> Date: Fri May 9 09:27:48 2025 -0700 [core] Raylet Node Manager RPC Failure Documentation (#52710) Documentation for what happens when node manager rpc's fail. Signed-off-by: dayshah <[email protected]> commit 478877e8f92faa1665adb9db967d4e88d5072279 Author: Kai-Hsun Chen <[email protected]> Date: Fri May 9 09:27:01 2025 -0700 [core] Implement a thread pool and call the CPython API on all threads within the same concurrency group (#52575) We see the following error message from the CI runs of `test_threaded_actor.py` ([example1](https://buildkite.com/ray-project/postmerge-macos/builds/5543#019659f5-7285-48fc-b1cf-588fd19bd050), [example2](https://buildkite.com/ray-project/postmerge-macos/builds/5534#01965796-294c-41de-8e6f-ef2970134df2)). ![image](https://github.com/user-attachments/assets/d3a5d47a-1dc6-41b8-b258-d33699d4a04a) The message "Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread" is very scary, but it will not cause any tests to fail. The root cause is that `PyGILState_Release` is called on a thread that has never called `PyGILState_Ensure`. See the [CPython source code](https://github.com/python/cpython/blob/a94c7528b596e9ec234f12ebeeb45fc731412b18/Python/pystate.c#L2870) for more details. The reason is that we can't control which thread in the thread pool will run the initializer/releaser. Hence, if a concurrency group has more than one thread, the error message above may be printed when we gracefully shut down an actor (i.e., `ray.actor.exit_actor()`). In this PR, we implement our own thread pool using `std::thread`, ensuring that both the initializer and the releaser run on the same thread. Consequently, from the Python interpreter’s perspective, all Python threads in the same concurrency group remain active even after they finish executing Ray tasks. ## Related issue number Closes #51071 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( ```python # test.py import ray @ray.remote class ThreadActor: def __init__(self): self.counter = 0 def increment(self): self.counter += 1 return self.counter def terminate(self): ray.actor.exit_actor() actor = ThreadActor.options(max_concurrency=10).remote() print(ray.get(actor.increment.remote())) ray.get(actor.terminate.remote()) ``` * Without this PR: Ran the test 20 times and encountered the error "PyGILState_Release: auto-releasing thread-state" 20 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 23 27 PM" src="https://github.com/user-attachments/assets/644ffd89-8edf-4678-a0cd-528eb642fe66" /> * With this PR: Ran the test 20 times and encountered the error 0 times. <img width="1728" alt="Screenshot 2025-04-30 at 5 25 10 PM" src="https://github.com/user-attachments/assets/03afaa26-0027-4df4-915d-6165bb83583f" /> --------- Signed-off-by: Kai-Hsun Chen <[email protected]> commit 0697b746c901b50125d8e6ba776bd6d5fe260224 Author: Dhyey Shah <[email protected]> Date: Fri May 9 09:25:36 2025 -0700 [core] [docs] Dynamic generator deprecation (#52887) Deprecating the dynamic ref generator. It was supposed to be deprecated a long time ago in favor of streaming generators but found that the deprecation warning on the docs page was actually never showing https://docs.ray.io/en/releases-2.46.0/ray-core/tasks/generators.html because the warning is above the title of the page. Moved the dynamic ref generator page under deprecated at the bottom of the ray generators page and outside the tasks subsection. Signed-off-by: dayshah <[email protected]> commit 262af06532209c4dd81fe2046e29dab5af91bc9c Author: Dhyey Shah <[email protected]> Date: Thu May 8 21:34:53 2025 -0700 [core] Label selector enum as class to fix windows build (#52884) Signed-off-by: dayshah <[email protected]> commit 2bb3c5b62094ae468aeba9e2c52abaf64d5dadba Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 16:19:20 2025 -0700 [pydoclint] core/_private docstring minimal format fixes (#52872) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <[email protected]> commit 489f233ffcd0789282c16ab6e5806ee7fea1b037 Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 15:48:23 2025 -0700 [pydoclint] util docstring minimal format errors (#52880)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <[email protected]> commit 94125f4c51d7a3369d3393d332f12dde7fc18b58 Author: matthewdeng <[email protected]> Date: Thu May 8 15:20:35 2025 -0700 [tune][train] update test_train_v2_integration to use correct RunConfig (#52882) Fixes an issue in which the wrong `RunConfig` was being used. Signed-off-by: Matthew Deng <[email protected]> commit e629be1cb48f59a3b34ac5cbe095832cd8c38e98 Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 15:01:49 2025 -0700 [pydoclint] data docstring minimal format errors (#52883) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number Extends https://github.com/ray-project/ray/pull/52874 w/ a few more ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit b094f84690be3e8648aa93e5f28cb11c01dce2b6 Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 14:28:03 2025 -0700 [pydoclint] core/autoscaler docstring minimal format errors (#52873) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit e3235250621029f5053642db8ec78ea51f12ba57 Author: Jani Monoses <[email protected]> Date: Fri May 9 00:04:24 2025 +0300 [llm] Embedding api (#52229) commit 6cc103e0b509b85368e4ee669f52a41d90ad6e89 Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 14:01:55 2025 -0700 [pydoclint] workflow docstring minimal format errors (#52881)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit ebdcd2e0db7271daa2bfbd98d528021b1b7a3f6b Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 13:56:58 2025 -0700 [pydoclint] tune docstring minimal format errors (#52879)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit 67b2469f943a58e316fc686690933236240f3be7 Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 13:56:35 2025 -0700 [pydoclint] serve docstring minimal format errors (#52877)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit ee5cc4510ff44b297deb8ea3f7cdf0a75b2190fa Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 13:55:32 2025 -0700 [pydoclint] llm docstring minimal format errors (#52876)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit 2e98bce60a7e6dfd5040895a3ed6b68a1357199c Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 13:50:20 2025 -0700 [pydoclint] train docstring minimal format errors (#52878)   ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks  ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Thomas Desrosiers <[email protected]> commit f6cc12ef4a5634d419b868aa67d8e40c8876577f Author: srinathk10 <[email protected]> Date: Thu May 8 13:48:30 2025 -0700 Handle non-contiguous Tensors based GPU transfer (#52548) ## Why are these changes needed? Handle non-contiguous Tensors based GPU transfer. This allows removing the overhead of combining Arrow chunked arrays during Arrow -> Numpy -> Tensor conversion. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: srinathk10 <[email protected]> commit c66bdf203278567d5a6ac3dfdcaff857899c1dba Author: Thomas Desrosiers <[email protected]> Date: Thu May 8 13:26:55 2025 -0700 [pydoclint] dashboard docstring minimal format errors (#52875) ## Why are these changes needed? This changes are part a batch effort to rewrite Ray's docstrings to be minimally pydoclint compliant. This PR focuses on making them at least pass basic formatting checks ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Thomas Desrosiers <[email protected]> commit 05ba14660f552f11c464d017678ed97eb67b2401 Author: Neil Girdhar <[email protected]> Date: Thu May 8 14:00:27 2025 -0400 [tune] Remove loguniform's base (#50415) Analytically, the base doesn't have any effect on the calculation for tune.loguniform and its variants. Numerically, it seems that the base can only make the calculation less precise, and definitely adds computation. Signed-off-by: Neil Girdhar <[email protected]> commit 62c6771f3f509868361bc9b360f3f61b056bb89b Author: Alexey Kudinkin <[email protected]> Date: Thu May 8 10:53:21 2025 -0700 [Data] Fix internal queues accounting for all Operators w/ an internal queue (#52806) ## Why are these changes needed? While working on https://github.com/ray-project/ray/pull/52754, i've realized that actually most of the operators w/ internal queues aren't reporting these properly. This PR addresses that problem by 1. Adding `InternalQueueOperatorMixin` forcing classes to implement required methods 2. Fixes `OpState` methods to properly distinguish b/w bundled pending dispatch and queued internally --------- Signed-off-by: Alexey Kudinkin <[email protected]> commit fba9084aae34f5339b8db7858364321eb3a18419 Author: Kai-Hsun Chen <[email protected]> Date: Thu May 8 09:55:44 2025 -0700 [core] `SetTaskStatus` should only be called within the same lock scope where `task_entry` is retrieved (#52770) This PR reverts https://github.com/ray-project/ray/pull/52695 and adds comments to explain where should `SetTaskStatus` be called. #52695 updates the value in submissble_tasks_ without acquiring the mutex lock. If multiple threads or coroutines write to the map, a rehash or deletion may occur, causing the pointer to the value to become invalid. ### Outdated PR statement #### Question https://github.com/ray-project/ray/pull/52695#discussion_r2072477177 Pointers to values in a `flat_hash_map` become invalid after a rehash. Additionally, we dereference those pointers in `RetryTask`, which doesn’t hold a mutex lock. Hence, it’s possible for the pointers to become invalid when other coroutines or threads insert or delete elements from the map, triggering a rehash. "Iterators, references, and pointers to elements are invalidated on rehash." ([reference](https://abseil.io/docs/cpp/guides/container)) #### Solution Changing `submissible_tasks_` from `absl::flat_hash_map<TaskID, TaskEntry>` to `absl::flat_hash_map<TaskID, std::unique_ptr<TaskEntry>>` requires a lot of changes. Hence, this PR implements a short-term solution by copying the value (i.e., TaskEntry) while holding the mutex lock where rehash will not be triggered by other threads / coroutine. --------- Signed-off-by: Kai-Hsun Chen <[email protected]> commit 9c988ee61460b3205081e5d6b6d903e3bf4f826e Author: Alan Guo <[email protected]> Date: Thu May 8 09:20:32 2025 -0700 fix grafana dashboards dropdowns for data and train dashboard (#52752) Previously the dropdown for variables for data and train dashboard wasn't working for a few reasons: - Data dashboard used the ray_data_allocated_bytes metric which doesn't seem to be guaranteed metric to be emitted when ray data is used - Both data and train dashboard used label_values which only shows values for live metrics. Since these variables represent entities that are expected to stop emitting metrics over time, I changed to use a query that checks for any values over the time range selected based on the approach [suggested here](https://stackoverflow.com/questions/52778031/how-to-provide-label-values-in-grafana-variables-with-time-range-for-prometheus) --------- Signed-off-by: Alan Guo <[email protected]> commit 589c1c94a5dcd80366a49418d28797b7f66aac99 Author: Alexey Kudinkin <[email protected]> Date: Thu May 8 05:12:55 2025 -0700 [Data] Re-enable Actor locality-based scheduling (#52861)   ## Why are these changes needed? Context --- Currently locality-aware scheduling is disabled due to https://github.com/ray-project/ray/issues/43466 However, since we're already using the new API, i've cleaned up the ranking and scheduling sequence and re-enabled locality aware scheduling. Changes --- - Added `RefBundle.get_preferred_object_locations` to compute a mapping of node-ids to total object bytes on the node - Added tests - Rebased `OutputSplitter` onto the new API - Rebased `ActorPool` onto `get_preferred_locations` - Re-enable locality hinting for actors by default ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <[email protected]> commit 1f084697c3d6d5286e68b853b43030d359df2012 Author: angelinalg <[email protected]> Date: Thu May 8 04:54:20 2025 -0700 [docstring][rllib] Fix indentation errors in docstrings. (#52849) commit 7a21b34f3876f70c1b12c65affab245d59b60cf7 Author: Sven Mika <[email protected]> Date: Thu May 8 09:06:24 2025 +0200 [RLlib] Add extra `self.stopped` check to APPO/IMPALA Learner (in case learner thread should stop while waiting for queue). (#52834) commit 988b689a08d18380afc7b70969dd4ed0c3b8ecee Author: Kevin H. Luu <[email protected]> Date: Wed May 7 22:26:47 2025 -0700 [docker] Update latest Docker dependencies for 2.46.0 release (#52863) Created by release automation bot. Update with commit 52b43d0998f40d8aada0ffb89f41497fea4878b2 Signed-off-by: dayshah <[email protected]> Co-authored-by: dayshah <[email protected]> commit 5868480f6bb20fbc49e4dea7d5adb1279f36b464 Author: Kai-Hsun Chen <[email protected]> Date: Wed May 7 21:22:10 2025 -0700 [core][chore] Correct `num_retries_left` and `num_oom_retries_left` in the log (#52857) Signed-off-by: Kai-Hsun Chen <[email protected]> commit d1823655707a7708ad99fda9cff93c1ac28b2f04 Author: angelinalg <[email protected]> Date: Wed May 7 21:09:20 2025 -0700 [docstring][train] fix indentation errors in docstrings (#52855) commit bcbee9fceeb4ff3edf2fa1518c915b8135aa204e Author: Kai-Hsun Chen <[email protected]> Date: Wed May 7 21:07:32 2025 -0700 [core][refactor] Remove skip_execution (#52856) Signed-off-by: Kai-Hsun Chen <[email protected]> commit a698b631d3916866c9061ad22ad4fb0ec3574da8 Author: kourosh hakhamaneshi <[email protected]> Date: Wed May 7 20:23:38 2025 -0700 [Serve.llm] Bugfix for duplication of `<bos>` token (#52853) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> commit 0c1faa63df052c7509d783a3d0f05eb28ad79baa Author: angelinalg <[email protected]> Date: Wed May 7 19:08:20 2025 -0700 [docstring][data] fix indentation errors in docstrings (#52844) commit ce51640c81b6230e4375ae6dd75d9a9092f13e8d Author: angelinalg <[email protected]> Date: Wed May 7 18:36:54 2025 -0700 [docstring][serve] Fix indentation in doc strings. (#52841) commit c0a3cbe6a9cd9960ef5822ca742fb05fa6408e8e Author: kourosh hakhamaneshi <[email protected]> Date: Wed May 7 18:08:49 2025 -0700 [Serve.llm][Bugfix] in stream batching, first part of the stream was always consumed and not streamed back from the router (#52848) This PR addresses a bug in stream batching where extra tokens in the first batch were being discarded and adds comprehensive unit tests to verify both chat and completion behaviors under different batching and streaming configurations. - Fixes token loss in stream batching by peeking at the first generator element and correctly handling batched responses. - Adds new fixtures and tests to cover various scenarios (chat/completion, stream true/false, and multiple batching intervals). - Removes redundant configuration in the LLM server test to align with the new streaming batching behavior. --------- Signed-off-by: Kourosh Hakhamaneshi <[email protected]> commit 7c23b5a4d698066a789a8343692c805bd5242e3f Author: angelinalg <[email protected]> Date: Wed May 7 17:31:46 2025 -0700 [docstring][llm] fixing indent errors in docstrings (#52842) commit 8910c3bcb543cd79ecc2b909da01e801a0f2a972 Author: srinathk10 <[email protected]> Date: Wed May 7 17:05:34 2025 -0700 Train Tests: Update Image classification map fn (#52845)   ## Why are these changes needed?  Train Tests: Update Image classification map fn. - Current Image processing does np->tensor conversion with transpose to CHW and normalization. ``` 'train/epoch-avg': 37.75413007199859, 'train/epoch-max': 37.75413007199859, 'train/epoch-min': 37.75413007199859, 'train/epoch-total': 37.75413007199859, 'train/global_throughput': 3495.5491702359923, 'train/iter_batch-avg': 0.03661318445453074, 'train/iter_batch-max': 0.6135890130008193, 'train/iter_batch-min': 1.593200067873113e-05, 'train/iter_batch-total': 18.526271333992554, 'train/iter_first_batch-avg': 19.143331854998905, 'train/iter_first_batch-max': 19.143331854998905, 'train/iter_first_batch-min': 19.143331854998905, 'train/iter_first_batch-total': 19.143331854998905, 'train/iter_skip_batch-avg': inf, 'train/iter_skip_batch-max': 0, 'train/iter_skip_batch-min': inf, 'train/iter_skip_batch-total': 0, 'train/local_throughput': 873.8872925589981, 'train/rows_processed-avg': 32.0, 'train/rows_processed-max': 32, 'train/rows_processed-min': 32, 'train/rows_processed-total': 16192, 'train/step-avg': 4.809962454802502e-06, 'train/step-max': 2.1455998648889363e-05, 'train/step-min': 5.109995981911197e-07, 'train/step-total': 0.002433841002130066, 'validation/iter_batch-avg': inf, 'validation/iter_batch-max': 0, 'validation/iter_batch-min': inf, 'validation/iter_batch-total': 0, 'validation/step-avg': inf, 'validation/step-max': 0, 'validation/step-min': inf, 'validation/step-total': 0} -------------------------------------------------------------------------------- 2025-05-07 12:12:11,659 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json 2025-05-07 12:12:11,660 INFO test_utils.py:1954 -- {"train/epoch-avg": 37.75413007199859, "train/epoch-min": 37.75413007199859, "train/epoch-max": 37.75413007199859, "train/epoch-total": 37.75413007199859, "train/iter_first_batch-avg": 19.143331854998905, "train/iter_first_batch-min": 19.143331854998905, "train/iter_first_batch-max": 19.143331854998905, "train/iter_first_batch-total": 19.143331854998905, "train/step-avg": 4.809962454802502e-06, "train/step-min": 5.109995981911197e-07, "train/step-max": 2.1455998648889363e-05, "train/step-total": 0.002433841002130066, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.03661318445453074, "train/iter_batch-min": 1.593200067873113e-05, "train/iter_batch-max": 0.6135890130008193, "train/iter_batch-total": 18.526271333992554, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total": 0, "checkpoint/load-avg": Infinity, "checkpoint/load-min": Infinity, "checkpoint/load-max": 0, "checkpoint/load-total": 0, "train/iter_skip_batch-avg": Infinity, "train/iter_skip_batch-min": Infinity, "train/iter_skip_batch-max": 0, "train/iter_skip_batch-total": 0, "train/local_throughput": 873.8872925589981, "train/global_throughput": 3495.5491702359923, "dataloader/train": {"producer_throughput": 1946.112621486268, "iter_stats": {"prefetch_block-avg": Infinity, "prefetch_block-min": Infinity, "prefetch_block-max": 0, "prefetch_block-total": 0, "fetch_block-avg": 0.0027022377159291976, "fetch_block-min": 0.0005052189990237821, "fetch_block-max": 0.0197697479998169, "fetch_block-total": 0.218881254990265, "block_to_batch-avg": 0.001253903843829141, "block_to_batch-min": 1.9893001081072725e-05, "block_to_batch-max": 0.01351481799974863, "block_to_batch-total": 0.6344753449775453, "format_batch-avg": 3.4910411080130395e-05, "format_batch-min": 9.00899976841174e-06, "format_batch-max": 0.0005209999999351567, "format_batch-total": 0.01766466800654598, "collate-avg": 0.0019578944209519855, "collate-min": 0.00021700100114685483, "collate-max": 0.013516342000002624, "collate-total": 0.9906945770017046, "finalize-avg": 0.011252377077071, "finalize-min": 0.004483607999645756, "finalize-max": 0.03162657899883925, "finalize-total": 5.693702800997926, "time_spent_blocked-avg": 0.0742146621321331, "time_spent_blocked-min": 6.807998943259008e-06, "time_spent_blocked-max": 19.143022770000243, "time_spent_blocked-total": 37.62683370099148, "time_spent_training-avg": 0.00021408673321073962, "time_spent_training-min": 9.916999260894954e-06, "time_spent_training-max": 0.009087054999326938, "time_spent_training-total": 0.10832788700463425}}} ``` - Updated Image processing does np->PIL->Tensor. ``` 'train/epoch-avg': 30.73613611499968, 'train/epoch-max': 30.73613611499968, 'train/epoch-min': 30.73613611499968, 'train/epoch-total': 30.73613611499968, 'train/global_throughput': 5434.769027373354, 'train/iter_batch-avg': 0.023547696209505146, 'train/iter_batch-max': 0.3791560619993106, 'train/iter_batch-min': 1.732300006551668e-05, 'train/iter_batch-total': 11.915134282009603, 'train/iter_first_batch-avg': 18.71798381300141, 'train/iter_first_batch-max': 18.71798381300141, 'train/iter_first_batch-min': 18.71798381300141, 'train/iter_first_batch-total': 18.71798381300141, 'train/iter_skip_batch-avg': inf, 'train/iter_skip_batch-max': 0, 'train/iter_skip_batch-min': inf, 'train/iter_skip_batch-total': 0, 'train/local_throughput': 1358.6922568433386, 'train/rows_processed-avg': 32.0, 'train/rows_processed-max': 32, 'train/rows_processed-min': 32, 'train/rows_processed-total': 16192, 'train/step-avg': 4.362646225640153e-06, 'train/step-max': 2.6562000130070373e-05, 'train/step-min': 4.579997039400041e-07, 'train/step-total': 0.0022074989901739173, 'validation/iter_batch-avg': inf, 'validation/iter_batch-max': 0, 'validation/iter_batch-min': inf, 'validation/iter_batch-total': 0, 'validation/step-avg': inf, 'validation/step-max': 0, 'validation/step-min': inf, 'validation/step-total': 0} -------------------------------------------------------------------------------- 2025-05-07 12:32:57,439 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json 2025-05-07 12:32:57,439 INFO test_utils.py:1954 -- {"train/epoch-avg": 30.73613611499968, "train/epoch-min": 30.73613611499968, "train/epoch-max": 30.73613611499968, "train/epoch-total": 30.73613611499968, "train/iter_first_batch-avg": 18.71798381300141, "train/iter_first_batch-min": 18.71798381300141, "train/iter_first_batch-max": 18.71798381300141, "train/iter_first_batch-total": 18.71798381300141, "train/step-avg": 4.362646225640153e-06, "train/step-min": 4.579997039400041e-07, "train/step-max": 2.6562000130070373e-05, "train/step-total": 0.0022074989901739173, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 16192, "train/iter_batch-avg": 0.023547696209505146, "train/iter_batch-min": 1.732300006551668e-05, "train/iter_batch-max": 0.3791560619993106, "train/iter_batch-total": 11.915134282009603, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total…

@kevin85421

## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number ray-project#52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> Signed-off-by: weiran11 <[email protected]>

@kevin85421

## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number ray-project#52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> Signed-off-by: zhaoch23 <[email protected]>

@kevin85421

## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number ray-project#52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>

@kevin85421

## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number ray-project#52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> Signed-off-by: Vicky Tsang <[email protected]>

@kevin85421

## Why are these changes needed? From [the logs](https://buildkite.com/ray-project/postmerge/builds/9840#01968329-4ba3-422f-91e0-542d09855d68) provided by @kevin85421, `test_autoscaler.py` has 2 flaky tests: ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ____________________ AutoscalingTest.testConfiguresNewNodes ____________________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testConfiguresNewNodes> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testConfiguresNewNodes(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] runner.respond_to_call("json .Config.Env", ["[]" for i in range(2)]) [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] LoadMetrics(), [2025-04-29T20:28:44Z] MockGcsClient(), [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] self.waitForNodes(2) [2025-04-29T20:28:44Z] self.provider.finish_starting_nodes() [2025-04-29T20:28:44Z] # TODO(rickyx): This is a hack to avoid running into race conditions [2025-04-29T20:28:44Z] # within v1 autoscaler. These should no longer be relevant in v2. [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] time.sleep(3) [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters={TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE}) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:2250: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` and ```python [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] =================================== FAILURES =================================== [2025-04-29T20:28:44Z] ________ AutoscalingTest.testDontScaleDownIdleTimeOutForPlacementGroups ________ [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self = <python.ray.tests.test_autoscaler.AutoscalingTest testMethod=testDontScaleDownIdleTimeOutForPlacementGroups> [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] def testDontScaleDownIdleTimeOutForPlacementGroups(self): [2025-04-29T20:28:44Z] config = copy.deepcopy(SMALL_CLUSTER) [2025-04-29T20:28:44Z] config["available_node_types"]["head"]["resources"][ [2025-04-29T20:28:44Z] "CPU" [2025-04-29T20:28:44Z] ] = 0 # make the head node not consume any resources. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"][ [2025-04-29T20:28:44Z] "min_workers" [2025-04-29T20:28:44Z] ] = 1 # prepare 1 worker upfront. [2025-04-29T20:28:44Z] config["idle_timeout_minutes"] = 0.1 [2025-04-29T20:28:44Z] config_path = self.write_config(config) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] self.provider = MockProvider() [2025-04-29T20:28:44Z] self.provider.create_node( [2025-04-29T20:28:44Z] {}, [2025-04-29T20:28:44Z] { [2025-04-29T20:28:44Z] TAG_RAY_NODE_KIND: NODE_KIND_HEAD, [2025-04-29T20:28:44Z] TAG_RAY_NODE_STATUS: STATUS_UP_TO_DATE, [2025-04-29T20:28:44Z] TAG_RAY_USER_NODE_TYPE: "head", [2025-04-29T20:28:44Z] }, [2025-04-29T20:28:44Z] 1, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] runner = MockProcessRunner() [2025-04-29T20:28:44Z] lm = LoadMetrics() [2025-04-29T20:28:44Z] mock_gcs_client = MockGcsClient() [2025-04-29T20:28:44Z] autoscaler = MockAutoscaler( [2025-04-29T20:28:44Z] config_path, [2025-04-29T20:28:44Z] lm, [2025-04-29T20:28:44Z] mock_gcs_client, [2025-04-29T20:28:44Z] max_failures=0, [2025-04-29T20:28:44Z] process_runner=runner, [2025-04-29T20:28:44Z] update_interval_s=0, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] # 1 worker is ready upfront. [2025-04-29T20:28:44Z] self.waitForNodes(1, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Restore min_workers to allow scaling down to 0. [2025-04-29T20:28:44Z] config["available_node_types"]["worker"]["min_workers"] = 0 [2025-04-29T20:28:44Z] self.write_config(config) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Create a placement group with 2 bundles that require 2 workers. [2025-04-29T20:28:44Z] placement_group_table_data = gcs_pb2.PlacementGroupTableData( [2025-04-29T20:28:44Z] placement_group_id=b"\000", [2025-04-29T20:28:44Z] strategy=common_pb2.PlacementStrategy.SPREAD, [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] for i in range(2): [2025-04-29T20:28:44Z] bundle = common_pb2.Bundle() [2025-04-29T20:28:44Z] bundle.bundle_id.placement_group_id = ( [2025-04-29T20:28:44Z] placement_group_table_data.placement_group_id [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] bundle.bundle_id.bundle_index = i [2025-04-29T20:28:44Z] bundle.unit_resources["CPU"] = 1 [2025-04-29T20:28:44Z] placement_group_table_data.bundles.append(bundle) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] # Mark the first worker as idle, but it should not be scaled down by the autoscaler because it will be used by the placement group. [2025-04-29T20:28:44Z] worker_ip = self.provider.non_terminated_node_ips(WORKER_FILTER)[0] [2025-04-29T20:28:44Z] lm.update( [2025-04-29T20:28:44Z] worker_ip, [2025-04-29T20:28:44Z] mock_raylet_id(), [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] {"CPU": 1}, [2025-04-29T20:28:44Z] 20, # idle for 20 seconds, which is longer than the idle_timeout_minutes. [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] None, [2025-04-29T20:28:44Z] [placement_group_table_data], [2025-04-29T20:28:44Z] ) [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] events = autoscaler.event_summarizer.summary() [2025-04-29T20:28:44Z] assert "Removing 1 nodes of type worker (idle)." not in events, events [2025-04-29T20:28:44Z] assert "Adding 1 node(s) of type worker." in events, events [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] autoscaler.update() [2025-04-29T20:28:44Z] > self.waitForNodes(2, tag_filters=WORKER_FILTER) [2025-04-29T20:28:44Z] [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:3708: [2025-04-29T20:28:44Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2025-04-29T20:28:44Z] python/ray/tests/test_autoscaler.py:414: in waitForNodes [2025-04-29T20:28:44Z] comparison(n, expected, msg="Unexpected node quantity.") [2025-04-29T20:28:44Z] E AssertionError: 3 != 2 : Unexpected node quantity. ``` They both overprovisioned work nodes (`AssertionError: 3 != 2`) due to the race between `autoscaler.update()` and the background NodeLauncher. In particular, the `pending_launches` counter in the `autoscaler` will be decreased by the background NodeLauncher asynchronously when launching a pending node. That can cause the pending node to disappear from the view of `autoscaler.update()` and thus let it overprovision a new node. The previous solution is adding `time.sleep(3)` between `autoscaler.update()` calls. https://github.com/ray-project/ray/blob/8561936c808464bbebc1117d3b5cd0652392b38b/python/ray/tests/test_autoscaler.py#L2245-L2247 I think we can make it more reliable by using `self.waitForNodes()` instead. This PR fixes these two flaky tests by adding `self.waitForNodes()` between `autoscaler.update()`. It also fixes errors (Runner deserialization error, event summary races) in the previous implementation of `testDontScaleDownIdleTimeOutForPlacementGroups`. Before this PR, these 2 tests would fail due to the race every 200 times. After this PR, these 2 tests can pass 10000 times without failures. ## Related issue number ray-project#52768 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rueian <[email protected]> Signed-off-by: Scott Lee <[email protected]>

[core][autoscaler][v1] deflaky test_autoscaler

921aa42

Signed-off-by: Rueian <[email protected]>

rueian marked this pull request as ready for review May 4, 2025 04:52

rueian requested a review from kevin85421 May 4, 2025 05:45

[core][autoscaler][v1] keep TODO notes

c3c2ebd

Signed-off-by: Rueian <[email protected]>

rueian force-pushed the deflaky-autoscaler-v1 branch from 126b569 to c3c2ebd Compare May 4, 2025 16:05

kevin85421 reviewed May 8, 2025

View reviewed changes

[core][autoscaler][v1] remove unnecessary autoscaler.update()

f57fd8b

Signed-off-by: Rueian <[email protected]>

kevin85421 approved these changes May 8, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label May 8, 2025

rueian added 2 commits May 8, 2025 12:03

Merge branch 'master' into deflaky-autoscaler-v1

f1f85de

Merge branch 'master' into deflaky-autoscaler-v1

ddf87d5

rueian mentioned this pull request May 9, 2025

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

51 tasks

jjyao reviewed May 9, 2025

View reviewed changes

edoakes approved these changes May 12, 2025

View reviewed changes

edoakes merged commit 0f864d7 into ray-project:master May 12, 2025
5 checks passed

kevin85421 mentioned this pull request May 14, 2025

[core][autoscaler][v1] deflaky test_autoscaler #52768

Closed

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][autoscaler][v1] deflaky test_autoscaler #52769

[core][autoscaler][v1] deflaky test_autoscaler #52769

Uh oh!

rueian commented May 3, 2025 •

edited

Loading

Uh oh!

kevin85421 left a comment

Uh oh!

kevin85421 May 8, 2025

Uh oh!

kevin85421 May 8, 2025

Uh oh!

kevin85421 May 8, 2025

Uh oh!

rueian May 8, 2025

Uh oh!

kevin85421 commented May 8, 2025

Uh oh!

rueian commented May 8, 2025

Uh oh!

kevin85421 commented May 9, 2025

Uh oh!

jjyao May 9, 2025 •

edited

Loading

Uh oh!

rueian May 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	# TODO(rickyx): This is a hack to avoid running into race conditions
	# within v1 autoscaler. These should no longer be relevant in v2.
	time.sleep(3)

		@@ -2239,14 +2239,13 @@ def testConfiguresNewNodes(self):
		)

		autoscaler.update()

[core][autoscaler][v1] deflaky test_autoscaler #52769

[core][autoscaler][v1] deflaky test_autoscaler #52769

Uh oh!

Conversation

rueian commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

rueian May 8, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 8, 2025

Uh oh!

rueian commented May 8, 2025

Uh oh!

kevin85421 commented May 9, 2025

Uh oh!

jjyao May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rueian commented May 3, 2025 •

edited

Loading

jjyao May 9, 2025 •

edited

Loading

rueian May 9, 2025 •

edited

Loading