[pull] master from ray-project:master #140

pull · 2023-06-29T21:04:01Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

## Why are these changes needed? * The on_exit hook was introduced to allow users to perform cleanup. * However, it triggers a race condition bug in fault tolerance - after on_exit is called and the UDF is deleted, and before the actor actually exits, another retry task is submitted to the actor. * This PR disables it by default. Eventually this should be fixed in Ray Core #53169 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: lkchen <[email protected]> Co-authored-by: lkchen <[email protected]>

Convert isort config in .isort.cfg to ruff config in pyproject.toml. Conversion strategy: known_local_folder -> known-local-folder known_third_party -> known-third-party known_afterray -> Created a new section afterray sections -> section_order skip_glob -> If already exists in tool.ruff.extend-exclude then do nothing. Otherwise add a rule to per-file-ignores to ignore the I rule. Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

#53222) Signed-off-by: Arthur <[email protected]>

The test is failing periodically in CI due to memory usage being above the memory monitor threshold prior to running an actor: https://buildkite.com/ray-project/postmerge/builds/10292#0196f473-d8f0-41fa-8041-addf42660c57/179-477 I'm attempting to deflake it by converting it to follow the same pattern as its sister test and combining them using a fixture. This only requires that we're below the memory monitor threshold to start, not that we are below `0.3 * threshold`. --------- Signed-off-by: Edward Oakes <[email protected]>

Follow up to #53171 for comments after merge. Mainly just unifying the gcs pid key between Python and C++. Signed-off-by: dayshah <[email protected]>

…arner index. (#53198)

Timing out occasionally on premerge & postmerge due to running up against the limit. Speeding up some of the tests by reducing/removing sleeps. --------- Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: Linkun Chen <[email protected]>

…ort (#51032)" (#53263) This reverts commit 2c7f6d4. `test_torch_tensor_transport_gpu` is [failing on postmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959). It appears this test does not run on premerge. Signed-off-by: Edward Oakes <[email protected]>

…#53253) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

The `autoscaler_cluster_resources` metric doesn't have the "instance" label so this graph is broken. If i remove the "instance" filter from the query, then this graph won't work with the instance dropdown variable. Instead, we replace with `ray_node_gpus_available` which does have the instance label. Signed-off-by: Alan Guo <[email protected]>

Attempts to fix two sources of flakiness: 1. `TypeError: Cannot read properties of null (reading 'visibilityState')` error message. We do "visibility" checks not because we want to verify the document is visible, but because we want to verify the element is visible. JSDom seems to sometimes not support the visibilityState field but its not something we really need to verify, so we mock it out. 2. Timeout for `ActorTable.component.test.tsx`. This is a long running test because it involves a lot of UI element interaction. We increase the timeout Signed-off-by: Alan Guo <[email protected]>

) Signed-off-by: Linkun <[email protected]>

to temporarily dodge the image tag limit Signed-off-by: Lonnie Liu <[email protected]>

- This PR removes an unused constants (PROCESS_TYPE_REPORTER, PROCESS_TYPE_WEB_UI). --------- Signed-off-by: Dongjun Na <[email protected]> Co-authored-by: Dhyey Shah <[email protected]>

… resources (#51978) ## Why are these changes needed? This PR enhances the logic of DeploymentScheduler._best_fit_node() to consider custom resource prioritization defined via the RAY_SERVE_CUSTOM_RESOURCES environment variable. The updated logic ensures that nodes are selected not just based on generic resources (CPU, GPU, memory), but also takes into account the specified order of custom resources when minimizing resource fragmentation. This change addresses inefficient scheduling behavior where custom resources like GRAM were ignored in prioritization, resulting in unnecessary fragmentation (e.g., #51361).  ## Related issue number Closes #51361 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]> Co-authored-by: Cindy Zhang <[email protected]>

Some of the tests run much more slowly on mac/windows, so bumping them back up. Closes #43777 Signed-off-by: Edward Oakes <[email protected]>

Making node manager testable by injecting in - client call manager - worker rpc pool - core worker subscriber - object directory - object manager - plasma store client Creating mocks based on interfaces for the above when necessary. Also removing the plasma store client + actual Plasma Store Evict message sending and handling codepath because it was unused and dead code. --------- Signed-off-by: dayshah <[email protected]>

Converts `test_reference_counting_*.py` to use shared Ray instance fixtures where possible. Those tests that require their own Ray instance are moved to `test_reference_counting_standalone.py`. --------- Signed-off-by: Edward Oakes <[email protected]>

…3241) - Use `list_actors()` and `nodes()` public APIs - Remove barely-used `new_port` utility --------- Signed-off-by: Edward Oakes <[email protected]>

## Why are these changes needed?  This PR updates to actor pool operator to label actors with logical actor IDs. This change is needed so we can disambiguate actors from their labels. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]>

## Why are these changes needed?  Skip test_daft for pyarrow version >= 14 ``` [2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py::test_daft_round_trip FAILED [100%] | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] =================================== FAILURES =================================== | [2025-05-21T17:20:28Z] _____________________________ test_daft_round_trip _____________________________ | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] ray_start = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.21', ray_version='3.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}') | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] def test_daft_round_trip(ray_start): | [2025-05-21T17:20:28Z] import daft | [2025-05-21T17:20:28Z] import numpy as np | [2025-05-21T17:20:28Z] import pandas as pd | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] data = { | [2025-05-21T17:20:28Z] "int_col": list(range(128)), | [2025-05-21T17:20:28Z] "str_col": [str(i) for i in range(128)], | [2025-05-21T17:20:28Z] "nested_list_col": [[i] * 3 for i in range(128)], | [2025-05-21T17:20:28Z] "tensor_col": [np.array([[i] * 3] * 3) for i in range(128)], | [2025-05-21T17:20:28Z] } | [2025-05-21T17:20:28Z] > df = daft.from_pydict(data) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py:27: | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/convert.py:80: in from_pydict | [2025-05-21T17:20:28Z] return DataFrame._from_pydict(data) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:520: in _from_pydict | [2025-05-21T17:20:28Z] return cls._from_tables(data_micropartition) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:572: in _from_tables | [2025-05-21T17:20:28Z] df._populate_preview() | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:473: in _populate_preview | [2025-05-21T17:20:28Z] preview_parts = self._result._get_preview_micropartitions(self._num_preview_rows) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/runners/ray_runner.py:272: in _get_preview_micropartitions | [2025-05-21T17:20:28Z] part: MicroPartition = ray.get(ref) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper | [2025-05-21T17:20:28Z] return fn(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper | [2025-05-21T17:20:28Z] return fn(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/client_mode_hook.py:103: in wrapper | [2025-05-21T17:20:28Z] return func(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:2842: in get | [2025-05-21T17:20:28Z] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] self = <ray._private.worker.Worker object at 0x7ff6bcc8b220> | [2025-05-21T17:20:28Z] object_refs = [ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000002e1f505)] | [2025-05-21T17:20:28Z] timeout = None, return_exceptions = False, skip_deserialization = False | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] def get_objects( | [2025-05-21T17:20:28Z] self, | [2025-05-21T17:20:28Z] object_refs: list, | [2025-05-21T17:20:28Z] timeout: Optional[float] = None, | [2025-05-21T17:20:28Z] return_exceptions: bool = False, | [2025-05-21T17:20:28Z] skip_deserialization: bool = False, | [2025-05-21T17:20:28Z] ): | [2025-05-21T17:20:28Z] """Get the values in the object store associated with the IDs. | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] Return the values from the local object store for object_refs. This | [2025-05-21T17:20:28Z] will block until all the values for object_refs have been written to | [2025-05-21T17:20:28Z] the local object store. | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] Args: | [2025-05-21T17:20:28Z] object_refs: A list of the object refs | [2025-05-21T17:20:28Z] whose values should be retrieved. | [2025-05-21T17:20:28Z] timeout: The maximum amount of time in | [2025-05-21T17:20:28Z] seconds to wait before returning. | [2025-05-21T17:20:28Z] return_exceptions: If any of the objects deserialize to an | [2025-05-21T17:20:28Z] Exception object, whether to return them as values in the | [2025-05-21T17:20:28Z] returned list. If False, then the first found exception will be | [2025-05-21T17:20:28Z] raised. | [2025-05-21T17:20:28Z] skip_deserialization: If true, only the buffer will be released and | [2025-05-21T17:20:28Z] the object associated with the buffer will not be deserialized. | [2025-05-21T17:20:28Z] Returns: | [2025-05-21T17:20:28Z] list: List of deserialized objects or None if skip_deserialization is True. | [2025-05-21T17:20:28Z] bytes: UUID of the debugger breakpoint we should drop | [2025-05-21T17:20:28Z] into or b"" if there is no breakpoint. | [2025-05-21T17:20:28Z] """ | [2025-05-21T17:20:28Z] # Make sure that the values are object refs. | [2025-05-21T17:20:28Z] for object_ref in object_refs: | [2025-05-21T17:20:28Z] if not isinstance(object_ref, ObjectRef): | [2025-05-21T17:20:28Z] raise TypeError( | [2025-05-21T17:20:28Z] f"Attempting to call `get` on the value {object_ref}, " | [2025-05-21T17:20:28Z] "which is not an ray.ObjectRef." | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] timeout_ms = ( | [2025-05-21T17:20:28Z] int(timeout * 1000) if timeout is not None and timeout != -1 else -1 | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] data_metadata_pairs: List[ | [2025-05-21T17:20:28Z] Tuple[ray._raylet.Buffer, bytes] | [2025-05-21T17:20:28Z] ] = self.core_worker.get_objects( | [2025-05-21T17:20:28Z] object_refs, | [2025-05-21T17:20:28Z] timeout_ms, | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] debugger_breakpoint = b"" | [2025-05-21T17:20:28Z] for data, metadata in data_metadata_pairs: | [2025-05-21T17:20:28Z] if metadata: | [2025-05-21T17:20:28Z] metadata_fields = metadata.split(b",") | [2025-05-21T17:20:28Z] if len(metadata_fields) >= 2 and metadata_fields[1].startswith( | [2025-05-21T17:20:28Z] ray_constants.OBJECT_METADATA_DEBUG_PREFIX | [2025-05-21T17:20:28Z] ): | [2025-05-21T17:20:28Z] debugger_breakpoint = metadata_fields[1][ | [2025-05-21T17:20:28Z] len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) : | [2025-05-21T17:20:28Z] ] | [2025-05-21T17:20:28Z] if skip_deserialization: | [2025-05-21T17:20:28Z] return None, debugger_breakpoint | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] values = self.deserialize_objects(data_metadata_pairs, object_refs) | [2025-05-21T17:20:28Z] if not return_exceptions: | [2025-05-21T17:20:28Z] # Raise exceptions instead of returning them to the user. | [2025-05-21T17:20:28Z] for i, value in enumerate(values): | [2025-05-21T17:20:28Z] if isinstance(value, RayError): | [2025-05-21T17:20:28Z] if isinstance(value, ray.exceptions.ObjectLostError): | [2025-05-21T17:20:28Z] global_worker.core_worker.dump_object_store_memory_usage() | [2025-05-21T17:20:28Z] if isinstance(value, RayTaskError): | [2025-05-21T17:20:28Z] raise value.as_instanceof_cause() | [2025-05-21T17:20:28Z] else: | [2025-05-21T17:20:28Z] > raise value | [2025-05-21T17:20:28Z] E ray.exceptions.RaySystemError: System error: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] E traceback: Traceback (most recent call last): | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects | [2025-05-21T17:20:28Z] E obj = self._deserialize_object(data, metadata, object_ref) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object | [2025-05-21T17:20:28Z] E return self._deserialize_msgpack_data(data, metadata_fields) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data | [2025-05-21T17:20:28Z] E python_objects = self._deserialize_pickle5_data(pickle5_data) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data | [2025-05-21T17:20:28Z] E obj = pickle.loads(in_band, buffers=buffers) | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow | [2025-05-21T17:20:28Z] E if DataType.from_arrow_type(array.type) == DataType.python(): | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type | [2025-05-21T17:20:28Z] E elif isinstance(arrow_type, pa.PyExtensionType): | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__ | [2025-05-21T17:20:28Z] E raise e | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__ | [2025-05-21T17:20:28Z] E return getattr(self._load_module(), name) | [2025-05-21T17:20:28Z] E AttributeError: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:932: RaySystemError | [2025-05-21T17:20:28Z] ---------------------------- Captured stderr setup ----------------------------- | [2025-05-21T17:20:28Z] 2025-05-21 17:19:46,973 WARNING services.py:2170 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2684354560 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=4.60gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. | [2025-05-21T17:20:28Z] 2025-05-21 17:19:47,112 INFO worker.py:1901 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 | [2025-05-21T17:20:28Z] ----------------------------- Captured stderr call ----------------------------- | [2025-05-21T17:20:28Z] 2025-05-21 17:19:49,133 ERROR serialization.py:462 -- module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] Traceback (most recent call last): | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects | [2025-05-21T17:20:28Z] obj = self._deserialize_object(data, metadata, object_ref) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object | [2025-05-21T17:20:28Z] return self._deserialize_msgpack_data(data, metadata_fields) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data | [2025-05-21T17:20:28Z] python_objects = self._deserialize_pickle5_data(pickle5_data) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data | [2025-05-21T17:20:28Z] obj = pickle.loads(in_band, buffers=buffers) | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow | [2025-05-21T17:20:28Z] if DataType.from_arrow_type(array.type) == DataType.python(): | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type | [2025-05-21T17:20:28Z] elif isinstance(arrow_type, pa.PyExtensionType): | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__ | [2025-05-21T17:20:28Z] raise e | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__ | [2025-05-21T17:20:28Z] return getattr(self._load_module(), name) | [2025-05-21T17:20:28Z] AttributeError: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] =========================== short test summary info ============================ | [2025-05-21T17:20:28Z] FAILED python/ray/data/tests/test_daft.py::test_daft_round_trip - ray.excepti... | [2025-05-21T17:20:28Z] ============================== 1 failed in 10.92s ============================== | [2025-05-21T17:20:28Z] ================================================================================ | ``` ## Related issue number  #53278 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

## Why are these changes needed?  `test_streaming_fault_tolerance` sometimes raises an `AssertionError` because the actor state isn't restarting like we expect. This PR adds more information to the assertion so it's easier to debug that test. ``` [2025-05-23T16:40:02Z] File "/rayci/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 424, in update_resource_usage -- | [2025-05-23T16:40:02Z] assert actor_state == gcs_pb2.ActorTableData.ActorState.RESTARTING | [2025-05-23T16:40:02Z] AssertionError | [2025-05-23T16:40:02Z] =========================== short test summary info ============================ | [2025-05-23T16:40:02Z] FAILED python/ray/data/tests/test_streaming_integration.py::test_streaming_fault_tolerance | [2025-05-23T16:40:02Z] ============= 1 failed, 15 passed, 1 skipped in 149.79s (0:02:29) ============== ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Balaji Veeramani <[email protected]>

## Why are these changes needed?  `test_e2e_autoscaling_up` launches actor tasks that block until the actor pool has launched 6 actors. If it takes longer than 10s for the actor pool to launch 6 actors, the test fails. Since the actor pool can sometimes launch new actors slowly, this test can non-deterministically fail. To mitigate this issue, this PR decreases the test to use launch the minimum number of actors (2). ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]>

## Why are these changes needed?  `test_groupby_multi_agg_with_nans` occasionally fails due to slight differences in floating point values. To make this unit test less brittle, this PR updates the test to compare with `pytest.approx`. ``` [2025-05-23T21:19:46Z] > assert _round_to_14_digits(expected_row) == _round_to_14_digits(result_row) -- | [2025-05-23T21:19:46Z] E AssertionError: assert {'max_a': 49,...a': -0.5, ...} == {'max_a': 49,...a': -0.5, ...} | [2025-05-23T21:19:46Z] E Omitting 5 identical items, use -vv to show | [2025-05-23T21:19:46Z] E Differing items: | [2025-05-23T21:19:46Z] E {'std_a': 29.01149197588202} != {'std_a': 29.01149197588201} | [2025-05-23T21:19:46Z] E Full diff: | [2025-05-23T21:19:46Z] E { | [2025-05-23T21:19:46Z] E 'max_a': 49, | [2025-05-23T21:19:46Z] E 'mean_a': -0.5,... | [2025-05-23T21:19:46Z] E | [2025-05-23T21:19:46Z] E ...Full output truncated (8 lines hidden), use '-vv' to show ``` ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Balaji Veeramani <[email protected]>

and use bazel run to upload stuff Signed-off-by: Lonnie Liu <[email protected]>

Follow-up for #52712 Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

we upgraded the machine and releases Signed-off-by: Lonnie Liu <[email protected]>

Saw this flake on [postmerge](https://buildkite.com/ray-project/postmerge/builds/10702#01975572-3a26-4e0b-b27d-0869ac5830fe/177-1191). Cleaned up the test in general: - Remove sleep conditions. - Remove use of direct gRPC connection to raylet. - Remove runtime_env test that was redundant with `test_runtime_env_env_vars.py`. Runtime decreased by ~50% locally, from `62.98s` to `33.48s`. --------- Signed-off-by: Edward Oakes <[email protected]>

These have been deprecated/ignored for a long time and are polluting the help string. --------- Signed-off-by: Edward Oakes <[email protected]>

… Data (#53220) --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: angelinalg <[email protected]>

the option is only used for `manylinux1`. ray is not using manylinux1 to build things any more. --------- Signed-off-by: Gagandeep Singh <[email protected]>

) Signed-off-by: hipudding <[email protected]>

Adding back "Run on Anyscale" button after a Anyscale PR was merged to show the button on Ray docs but not Anyscale template previews Signed-off-by: Chris Zhang <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

…fig (#53681) The schema of compute config that Kuberay service takes in is currently a bit different from the schema of cluster compute in release tests. This is a helper function built to convert the cluster compute into Kuberay compute config that eventually gets sent into Kuberay service --------- Signed-off-by: kevin <[email protected]>

…private to _common (#53652) Fixes #53478 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

Signed-off-by: abrar <[email protected]>

don't know for sure, but seems like a race condition between when the cancel happens and attempt to access the ray task result causes `RayTaskCancelled` exception. Used the repro script in the ticket to confirm that the issue is resolved #53639. --------- Signed-off-by: abrar <[email protected]>

- Add prefix `rayproject/ray` for all tags - Authorize docker with credentials from SSM - Mock authorize docker in unit test since it's not needed --------- Signed-off-by: kevin <[email protected]>

Even though there's no spilling happening, ray still logs "trying to spill." Add an early exit to avoid confusing logs. Closes #53086 Signed-off-by: tianyi-ge <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

Unset, unused. --------- Signed-off-by: Edward Oakes <[email protected]>

Process exits ungracefully when you `os._exit` on Windows: https://buildkite.com/ray-project/postmerge/builds/10736#01975889-d332-47e0-8290-284c63ec43b3/1834-1905 Signed-off-by: Edward Oakes <[email protected]>

Signed-off-by: Nikhil Ghosh <[email protected]>

Was [deprecated](#51309) in Ray 2.44 along with Ray workflows. --------- Signed-off-by: Edward Oakes <[email protected]>

…TaskId (#53695) ```cpp task_spec.ParentTaskId().Binary() ``` * `ParentTaskId()` deserializes binary to `TaskId`. * `Binary()` serializes `TaskId` to binary. Signed-off-by: Kai-Hsun Chen <[email protected]>

…53686) Fixes: #53478 --------- Signed-off-by: Nehil Jain <[email protected]>

…oo backend (#53319) Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using `ray.util.collective`. These APIs are currently under `ray.experimental.collective` as they are experimental and to avoid potential conflicts with `ray.util.collective`. See test_experimental_collective::test_api_basic for API usage. - create_collective_group - destroy_collective_group - get_collective_groups Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

Remove some open telemetry code that was added in #51077. These files were added to test the complication of the open telemetry library, but we never end up using these files. Test: - CI Signed-off-by: can <[email protected]>

…t_ref` for small and non-GPU objects (#53692) This PR is based on #53630. See #53623 for the issue. In this PR, we clear the object ref when the arg's tensor transport is not OBJECT_STORE. Closes #53623 --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Stephanie wang <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>

follow up of #53652 Signed-off-by: abrar <[email protected]>

Improve the string representation of `WorkerHealthCheckFailedError` to also include the base reason why the health check failed. Signed-off-by: Matthew Deng <[email protected]>

…#53718) Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: Timothy Seah <[email protected]>

``` REGRESSION 12.82%: tasks_per_second (THROUGHPUT) regresses from 221.2222291023174 to 192.87246715163326 in benchmarks/many_nodes.json REGRESSION 12.73%: actors_per_second (THROUGHPUT) regresses from 634.2824761754516 to 553.5098466276525 in benchmarks/many_actors.json REGRESSION 12.26%: client__get_calls (THROUGHPUT) regresses from 1160.5254002780266 to 1018.2939193917422 in microbenchmark.json REGRESSION 5.15%: multi_client_put_gigabytes (THROUGHPUT) regresses from 39.896743394372585 to 37.84234603653026 in microbenchmark.json REGRESSION 4.04%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9480091293556955 to 0.909684480871914 in microbenchmark.json REGRESSION 3.72%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8318.094433102775 to 8008.806358661164 in microbenchmark.json REGRESSION 3.01%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2020.4236901532247 to 1959.5608579309087 in microbenchmark.json REGRESSION 2.80%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23716.451989299432 to 23052.03512506016 in microbenchmark.json REGRESSION 2.71%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.105537951105227 to 19.561225172916046 in microbenchmark.json REGRESSION 2.69%: pgs_per_second (THROUGHPUT) regresses from 13.650631601393242 to 13.282795863244178 in benchmarks/many_pgs.json REGRESSION 1.35%: single_client_tasks_async (THROUGHPUT) regresses from 8081.168521067462 to 7971.849053459262 in microbenchmark.json REGRESSION 1.31%: n_n_actor_calls_async (THROUGHPUT) regresses from 27465.39608393524 to 27105.63998087682 in microbenchmark.json REGRESSION 1.09%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14569.862277318796 to 14411.155262801181 in microbenchmark.json REGRESSION 1.05%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1483.660979687764 to 1468.0999827232097 in microbenchmark.json REGRESSION 0.92%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.796724102063072 to 12.67868528378648 in microbenchmark.json REGRESSION 0.88%: placement_group_create/removal (THROUGHPUT) regresses from 768.9082534403586 to 762.110356621388 in microbenchmark.json REGRESSION 0.87%: single_client_tasks_sync (THROUGHPUT) regresses from 969.5757440611114 to 961.1131766783709 in microbenchmark.json REGRESSION 0.35%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1069.1602586173547 to 1065.4228066614364 in microbenchmark.json REGRESSION 0.23%: client__put_gigabytes (THROUGHPUT) regresses from 0.1529268174148042 to 0.1525808986433169 in microbenchmark.json REGRESSION 0.05%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5113.112753017668 to 5110.344528620948 in microbenchmark.json REGRESSION 49.81%: dashboard_p99_latency_ms (LATENCY) regresses from 275.082 to 412.087 in benchmarks/many_pgs.json REGRESSION 37.19%: dashboard_p95_latency_ms (LATENCY) regresses from 6.696 to 9.186 in benchmarks/many_pgs.json REGRESSION 36.35%: dashboard_p95_latency_ms (LATENCY) regresses from 2283.949 to 3114.217 in benchmarks/many_actors.json REGRESSION 13.04%: dashboard_p99_latency_ms (LATENCY) regresses from 675.061 to 763.093 in benchmarks/many_tasks.json REGRESSION 11.46%: dashboard_p50_latency_ms (LATENCY) regresses from 3.856 to 4.298 in benchmarks/many_pgs.json REGRESSION 11.23%: dashboard_p95_latency_ms (LATENCY) regresses from 437.195 to 486.283 in benchmarks/many_tasks.json REGRESSION 8.97%: 107374182400_large_object_time (LATENCY) regresses from 29.323037406000026 to 31.951921509999977 in scalability/single_node.json REGRESSION 6.24%: avg_iteration_time (LATENCY) regresses from 1.1950538015365602 to 1.2696449542045594 in stress_tests/stress_test_dead_actors.json REGRESSION 5.86%: dashboard_p50_latency_ms (LATENCY) regresses from 8.293 to 8.779 in benchmarks/many_actors.json REGRESSION 2.91%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 12.241764013000008 to 12.597426240999994 in scalability/object_store.json REGRESSION 1.02%: avg_pg_remove_time_ms (LATENCY) regresses from 1.2291068678679091 to 1.2416502777781075 in stress_tests/stress_test_placement_group.json REGRESSION 0.57%: dashboard_p50_latency_ms (LATENCY) regresses from 5.658 to 5.69 in benchmarks/many_nodes.json REGRESSION 0.34%: 10000_args_time (LATENCY) regresses from 18.764070391999994 to 18.828636121000002 in scalability/single_node.json ``` Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

pull bot added the ⤵️ pull label Jun 29, 2023

aslonnie force-pushed the master branch from 966c10e to 3d32f98 Compare December 25, 2023 04:13

pcmoritz force-pushed the master branch from 24bd5a4 to 3045482 Compare March 26, 2025 22:34

raulchen and others added 27 commits May 22, 2025 23:05

[Serve.llm] feat: add missing repetition_penalty vLLM sampling param (

9a81218

#53222) Signed-off-by: Arthur <[email protected]>

[core] Follow-up to record gcs process metrics (#53243)

26b4151

Follow up to #53171 for comments after merge. Mainly just unifying the gcs pid key between Python and C++. Signed-off-by: dayshah <[email protected]>

[RLlib] Enhance Learner/LearnerGroup APIs: Add placement group and le…

a549a98

…arner index. (#53198)

[core] Speed up doctest[core] tests (#53238)

4cc0257

Timing out occasionally on premerge & postmerge due to running up against the limit. Speeding up some of the tests by reducing/removing sleeps. --------- Signed-off-by: Edward Oakes <[email protected]>

[Serve.llm] Refactor code in sever.llm in prep of P/D (#53258)

55ae177

Signed-off-by: Linkun Chen <[email protected]>

[Serve.llm] Fix runtime passthrough and auto-executor class selection (…

5d88264

…#53253) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[serve.llm] Prefill-Decode disaggregation initial implementation (#53092

ea83977

) Signed-off-by: Linkun <[email protected]>

[ci] kick forge content (#53273)

4842878

to temporarily dodge the image tag limit Signed-off-by: Lonnie Liu <[email protected]>

[Chore][Core] Remove unused constants (#51779)

e13056f

- This PR removes an unused constants (PROCESS_TYPE_REPORTER, PROCESS_TYPE_WEB_UI). --------- Signed-off-by: Dongjun Na <[email protected]> Co-authored-by: Dhyey Shah <[email protected]>

[core] Revert some test sizes (#53277)

3fd6015

Some of the tests run much more slowly on mac/windows, so bumping them back up. Closes #43777 Signed-off-by: Edward Oakes <[email protected]>

[serve] Remove some usage of private APIs in test_standalone.py (#5…

4f12d60

…3241) - Use `list_actors()` and `nodes()` public APIs - Remove barely-used `new_port` utility --------- Signed-off-by: Edward Oakes <[email protected]>

[ci] bazelize copy_files (#53271)

06556fe

and use bazel run to upload stuff Signed-off-by: Lonnie Liu <[email protected]>

[CI] Re-enable isort for python/ray/_private/[a-q] (#53259)

86331a9

Follow-up for #52712 Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

crypdick and others added 30 commits June 9, 2025 17:06

[Docs] Adds second notebook to timeseries tutorial (#53561)

6c9633c

[ci] change macos intel platform to 12_0 (#53671)

7243744

we upgraded the machine and releases Signed-off-by: Lonnie Liu <[email protected]>

[core] Remove deprecated ray start CLI options (#53675)

d55eb17

These have been deprecated/ignored for a long time and are polluting the help string. --------- Signed-off-by: Edward Oakes <[email protected]>

[train][template] Add Anyscale template for PyTorch + Ray Train + Ray…

6f0ce28

… Data (#53220) --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: angelinalg <[email protected]>

BLD: Remove redundant manylinux1 related flag in .bazelrc (#53549)

ec1f2a4

the option is only used for `manylinux1`. ray is not using manylinux1 to build things any more. --------- Signed-off-by: Gagandeep Singh <[email protected]>

[Compiled Graph] Enhance Compile Graph with Multi-Device Support (#53395

eae786c

) Signed-off-by: hipudding <[email protected]>

add back run on anyscale button (#53688)

1c4ffac

Adding back "Run on Anyscale" button after a Anyscale PR was merged to show the button on Ray docs but not Anyscale template previews Signed-off-by: Chris Zhang <[email protected]>

[Core] Vendor setproctitle (#53471)

385e000

Signed-off-by: Jiajun Yao <[email protected]>

[core] Migrate wait_for_condition and async_wait_for_condition from _…

42fa5af

…private to _common (#53652) Fixes #53478 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

Code refactoring in proxy (#53644)

ab2f988

Signed-off-by: abrar <[email protected]>

[ci] Fix crane auth issue for nightly multi arch tagging (#53483)

d90e270

- Add prefix `rayproject/ray` for all tags - Authorize docker with credentials from SSM - Mock authorize docker in unit test since it's not needed --------- Signed-off-by: kevin <[email protected]>

[core] Remove unused object_ref_seed parameter (#53698)

2b998f5

Unset, unused. --------- Signed-off-by: Edward Oakes <[email protected]>

[core] Fix test_multi_tenancy.py on Windows (#53699)

d188864

Process exits ungracefully when you `os._exit` on Windows: https://buildkite.com/ray-project/postmerge/builds/10736#01975889-d332-47e0-8290-284c63ec43b3/1834-1905 Signed-off-by: Edward Oakes <[email protected]>

[serve.llm] delete dead code from prompt format days (#53621)

6c3dfaf

Signed-off-by: Nikhil Ghosh <[email protected]>

[core] Remove deprecated storage parameter to ray.init (#53669)

20b6c2f

Was [deprecated](#51309) in Ray 2.44 along with Ray workflows. --------- Signed-off-by: Edward Oakes <[email protected]>

[core][3/N] Avoid unnecessary deserialization/serialization of Parent…

a6c5ecf

…TaskId (#53695) ```cpp task_spec.ParentTaskId().Binary() ``` * `ParentTaskId()` deserializes binary to `TaskId`. * `Binary()` serializes `TaskId` to binary. Signed-off-by: Kai-Hsun Chen <[email protected]>

[core] Migrate ray.private.pydantic_compat from _private to _common (#…

853e279

…53686) Fixes: #53478 --------- Signed-off-by: Nehil Jain <[email protected]>

E2e rag (#53703)

37364e9

[core] remove dead open telemetry code (#53709)

4f83e34

Remove some open telemetry code that was added in #51077. These files were added to test the complication of the open telemetry library, but we never end up using these files. Test: - CI Signed-off-by: can <[email protected]>

[Docs] Finalize time-series tutorial, add lockfiles (#53710)

abddc5f

fix CI, wrong import path (#53715)

c0d7e84

follow up of #53652 Signed-off-by: abrar <[email protected]>

[train] add trace to WorkerHealthCheckFailedError (#53626)

0e24c26

Improve the string representation of `WorkerHealthCheckFailedError` to also include the base reason why the health check failed. Signed-off-by: Matthew Deng <[email protected]>

[train][template] pytorch + train + data template uses absolute links (…

f721b26

…#53718) Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: Timothy Seah <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ray-project:master #140

[pull] master from ray-project:master #140

Uh oh!

pull bot commented Jun 29, 2023 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ray-project:master #140

Are you sure you want to change the base?

[pull] master from ray-project:master #140

Uh oh!

Conversation

pull bot commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Jun 29, 2023 •

edited

Loading