Skip to content

[pull] master from ray-project:master #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7,533 commits into
base: master
Choose a base branch
from
Open

Conversation

pull[bot]
Copy link

@pull pull bot commented Jun 29, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

raulchen and others added 27 commits May 22, 2025 23:05
## Why are these changes needed?

* The on_exit hook was introduced to allow users to perform cleanup.
* However, it triggers a race condition bug in fault tolerance - after
on_exit is called and the UDF is deleted, and before the actor actually
exits, another retry task is submitted to the actor.
* This PR disables it by default. Eventually this should be fixed in Ray
Core #53169

---------

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: lkchen <[email protected]>
Co-authored-by: lkchen <[email protected]>
Convert isort config in .isort.cfg to ruff config in pyproject.toml.

Conversion strategy:

known_local_folder -> known-local-folder
known_third_party -> known-third-party
known_afterray -> Created a new section afterray
sections -> section_order
skip_glob -> If already exists in tool.ruff.extend-exclude then do nothing. Otherwise add a rule to per-file-ignores to ignore the I rule.

Signed-off-by: Chi-Sheng Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
The test is failing periodically in CI due to memory usage being above
the memory monitor threshold prior to running an actor:
https://buildkite.com/ray-project/postmerge/builds/10292#0196f473-d8f0-41fa-8041-addf42660c57/179-477

I'm attempting to deflake it by converting it to follow the same pattern
as its sister test and combining them using a fixture. This only
requires that we're below the memory monitor threshold to start, not
that we are below `0.3 * threshold`.

---------

Signed-off-by: Edward Oakes <[email protected]>
Follow up to #53171 for comments
after merge.

Mainly just unifying the gcs pid key between Python and C++.

Signed-off-by: dayshah <[email protected]>
Timing out occasionally on premerge & postmerge due to running up
against the limit.

Speeding up some of the tests by reducing/removing sleeps.

---------

Signed-off-by: Edward Oakes <[email protected]>
…ort (#51032)" (#53263)

This reverts commit 2c7f6d4.

`test_torch_tensor_transport_gpu` is [failing on
postmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959).
It appears this test does not run on premerge.

Signed-off-by: Edward Oakes <[email protected]>
The `autoscaler_cluster_resources` metric doesn't have the "instance"
label so this graph is broken.
If i remove the "instance" filter from the query, then this graph won't
work with the instance dropdown variable.

Instead, we replace with `ray_node_gpus_available` which does have the
instance label.

Signed-off-by: Alan Guo <[email protected]>
Attempts to fix two sources of flakiness:
1. `TypeError: Cannot read properties of null (reading
'visibilityState')` error message. We do "visibility" checks not because
we want to verify the document is visible, but because we want to verify
the element is visible. JSDom seems to sometimes not support the
visibilityState field but its not something we really need to verify, so
we mock it out.
2. Timeout for `ActorTable.component.test.tsx`. This is a long running
test because it involves a lot of UI element interaction. We increase
the timeout


Signed-off-by: Alan Guo <[email protected]>
to temporarily dodge the image tag limit

Signed-off-by: Lonnie Liu <[email protected]>
- This PR removes an unused constants (PROCESS_TYPE_REPORTER,
PROCESS_TYPE_WEB_UI).
---------

Signed-off-by: Dongjun Na <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
… resources (#51978)

## Why are these changes needed?

This PR enhances the logic of DeploymentScheduler._best_fit_node() to
consider custom resource prioritization defined via the
RAY_SERVE_CUSTOM_RESOURCES environment variable. The updated logic
ensures that nodes are selected not just based on generic resources
(CPU, GPU, memory), but also takes into account the specified order of
custom resources when minimizing resource fragmentation.

This change addresses inefficient scheduling behavior where custom
resources like GRAM were ignored in prioritization, resulting in
unnecessary fragmentation (e.g., #51361).
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number
Closes #51361

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kitae <[email protected]>
Co-authored-by: Cindy Zhang <[email protected]>
Some of the tests run much more slowly on mac/windows, so bumping them
back up.

Closes #43777

Signed-off-by: Edward Oakes <[email protected]>
Making node manager testable by injecting in
- client call manager
- worker rpc pool
- core worker subscriber
- object directory
- object manager
- plasma store client

Creating mocks based on interfaces for the above when necessary.

Also removing the plasma store client + actual Plasma Store Evict
message sending and handling codepath because it was unused and dead
code.

---------

Signed-off-by: dayshah <[email protected]>
Converts `test_reference_counting_*.py` to use shared Ray instance
fixtures where possible.

Those tests that require their own Ray instance are moved to
`test_reference_counting_standalone.py`.

---------

Signed-off-by: Edward Oakes <[email protected]>
…3241)

- Use `list_actors()` and `nodes()` public APIs
- Remove barely-used `new_port` utility

---------

Signed-off-by: Edward Oakes <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

This PR updates to actor pool operator to label actors with logical
actor IDs. This change is needed so we can disambiguate actors from
their labels.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Skip test_daft for pyarrow version >= 14

```
[2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py::test_daft_round_trip FAILED          [100%]
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z] =================================== FAILURES ===================================
  | [2025-05-21T17:20:28Z] _____________________________ test_daft_round_trip _____________________________
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z] ray_start = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.21', ray_version='3.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}')
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]     def test_daft_round_trip(ray_start):
  | [2025-05-21T17:20:28Z]         import daft
  | [2025-05-21T17:20:28Z]         import numpy as np
  | [2025-05-21T17:20:28Z]         import pandas as pd
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         data = {
  | [2025-05-21T17:20:28Z]             "int_col": list(range(128)),
  | [2025-05-21T17:20:28Z]             "str_col": [str(i) for i in range(128)],
  | [2025-05-21T17:20:28Z]             "nested_list_col": [[i] * 3 for i in range(128)],
  | [2025-05-21T17:20:28Z]             "tensor_col": [np.array([[i] * 3] * 3) for i in range(128)],
  | [2025-05-21T17:20:28Z]         }
  | [2025-05-21T17:20:28Z] >       df = daft.from_pydict(data)
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py:27:
  | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/convert.py:80: in from_pydict
  | [2025-05-21T17:20:28Z]     return DataFrame._from_pydict(data)
  | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:520: in _from_pydict
  | [2025-05-21T17:20:28Z]     return cls._from_tables(data_micropartition)
  | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:572: in _from_tables
  | [2025-05-21T17:20:28Z]     df._populate_preview()
  | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:473: in _populate_preview
  | [2025-05-21T17:20:28Z]     preview_parts = self._result._get_preview_micropartitions(self._num_preview_rows)
  | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/runners/ray_runner.py:272: in _get_preview_micropartitions
  | [2025-05-21T17:20:28Z]     part: MicroPartition = ray.get(ref)
  | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper
  | [2025-05-21T17:20:28Z]     return fn(*args, **kwargs)
  | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper
  | [2025-05-21T17:20:28Z]     return fn(*args, **kwargs)
  | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/client_mode_hook.py:103: in wrapper
  | [2025-05-21T17:20:28Z]     return func(*args, **kwargs)
  | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:2842: in get
  | [2025-05-21T17:20:28Z]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z] self = <ray._private.worker.Worker object at 0x7ff6bcc8b220>
  | [2025-05-21T17:20:28Z] object_refs = [ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000002e1f505)]
  | [2025-05-21T17:20:28Z] timeout = None, return_exceptions = False, skip_deserialization = False
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]     def get_objects(
  | [2025-05-21T17:20:28Z]         self,
  | [2025-05-21T17:20:28Z]         object_refs: list,
  | [2025-05-21T17:20:28Z]         timeout: Optional[float] = None,
  | [2025-05-21T17:20:28Z]         return_exceptions: bool = False,
  | [2025-05-21T17:20:28Z]         skip_deserialization: bool = False,
  | [2025-05-21T17:20:28Z]     ):
  | [2025-05-21T17:20:28Z]         """Get the values in the object store associated with the IDs.
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         Return the values from the local object store for object_refs. This
  | [2025-05-21T17:20:28Z]         will block until all the values for object_refs have been written to
  | [2025-05-21T17:20:28Z]         the local object store.
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         Args:
  | [2025-05-21T17:20:28Z]             object_refs: A list of the object refs
  | [2025-05-21T17:20:28Z]                 whose values should be retrieved.
  | [2025-05-21T17:20:28Z]             timeout: The maximum amount of time in
  | [2025-05-21T17:20:28Z]                 seconds to wait before returning.
  | [2025-05-21T17:20:28Z]             return_exceptions: If any of the objects deserialize to an
  | [2025-05-21T17:20:28Z]                 Exception object, whether to return them as values in the
  | [2025-05-21T17:20:28Z]                 returned list. If False, then the first found exception will be
  | [2025-05-21T17:20:28Z]                 raised.
  | [2025-05-21T17:20:28Z]             skip_deserialization: If true, only the buffer will be released and
  | [2025-05-21T17:20:28Z]                 the object associated with the buffer will not be deserialized.
  | [2025-05-21T17:20:28Z]         Returns:
  | [2025-05-21T17:20:28Z]             list: List of deserialized objects or None if skip_deserialization is True.
  | [2025-05-21T17:20:28Z]             bytes: UUID of the debugger breakpoint we should drop
  | [2025-05-21T17:20:28Z]                 into or b"" if there is no breakpoint.
  | [2025-05-21T17:20:28Z]         """
  | [2025-05-21T17:20:28Z]         # Make sure that the values are object refs.
  | [2025-05-21T17:20:28Z]         for object_ref in object_refs:
  | [2025-05-21T17:20:28Z]             if not isinstance(object_ref, ObjectRef):
  | [2025-05-21T17:20:28Z]                 raise TypeError(
  | [2025-05-21T17:20:28Z]                     f"Attempting to call `get` on the value {object_ref}, "
  | [2025-05-21T17:20:28Z]                     "which is not an ray.ObjectRef."
  | [2025-05-21T17:20:28Z]                 )
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         timeout_ms = (
  | [2025-05-21T17:20:28Z]             int(timeout * 1000) if timeout is not None and timeout != -1 else -1
  | [2025-05-21T17:20:28Z]         )
  | [2025-05-21T17:20:28Z]         data_metadata_pairs: List[
  | [2025-05-21T17:20:28Z]             Tuple[ray._raylet.Buffer, bytes]
  | [2025-05-21T17:20:28Z]         ] = self.core_worker.get_objects(
  | [2025-05-21T17:20:28Z]             object_refs,
  | [2025-05-21T17:20:28Z]             timeout_ms,
  | [2025-05-21T17:20:28Z]         )
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         debugger_breakpoint = b""
  | [2025-05-21T17:20:28Z]         for data, metadata in data_metadata_pairs:
  | [2025-05-21T17:20:28Z]             if metadata:
  | [2025-05-21T17:20:28Z]                 metadata_fields = metadata.split(b",")
  | [2025-05-21T17:20:28Z]                 if len(metadata_fields) >= 2 and metadata_fields[1].startswith(
  | [2025-05-21T17:20:28Z]                     ray_constants.OBJECT_METADATA_DEBUG_PREFIX
  | [2025-05-21T17:20:28Z]                 ):
  | [2025-05-21T17:20:28Z]                     debugger_breakpoint = metadata_fields[1][
  | [2025-05-21T17:20:28Z]                         len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) :
  | [2025-05-21T17:20:28Z]                     ]
  | [2025-05-21T17:20:28Z]         if skip_deserialization:
  | [2025-05-21T17:20:28Z]             return None, debugger_breakpoint
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z]         values = self.deserialize_objects(data_metadata_pairs, object_refs)
  | [2025-05-21T17:20:28Z]         if not return_exceptions:
  | [2025-05-21T17:20:28Z]             # Raise exceptions instead of returning them to the user.
  | [2025-05-21T17:20:28Z]             for i, value in enumerate(values):
  | [2025-05-21T17:20:28Z]                 if isinstance(value, RayError):
  | [2025-05-21T17:20:28Z]                     if isinstance(value, ray.exceptions.ObjectLostError):
  | [2025-05-21T17:20:28Z]                         global_worker.core_worker.dump_object_store_memory_usage()
  | [2025-05-21T17:20:28Z]                     if isinstance(value, RayTaskError):
  | [2025-05-21T17:20:28Z]                         raise value.as_instanceof_cause()
  | [2025-05-21T17:20:28Z]                     else:
  | [2025-05-21T17:20:28Z] >                       raise value
  | [2025-05-21T17:20:28Z] E                       ray.exceptions.RaySystemError: System error: module 'pyarrow' has no attribute 'PyExtensionType'
  | [2025-05-21T17:20:28Z] E                       traceback: Traceback (most recent call last):
  | [2025-05-21T17:20:28Z] E                         File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects
  | [2025-05-21T17:20:28Z] E                           obj = self._deserialize_object(data, metadata, object_ref)
  | [2025-05-21T17:20:28Z] E                         File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object
  | [2025-05-21T17:20:28Z] E                           return self._deserialize_msgpack_data(data, metadata_fields)
  | [2025-05-21T17:20:28Z] E                         File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data
  | [2025-05-21T17:20:28Z] E                           python_objects = self._deserialize_pickle5_data(pickle5_data)
  | [2025-05-21T17:20:28Z] E                         File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data
  | [2025-05-21T17:20:28Z] E                           obj = pickle.loads(in_band, buffers=buffers)
  | [2025-05-21T17:20:28Z] E                         File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow
  | [2025-05-21T17:20:28Z] E                           if DataType.from_arrow_type(array.type) == DataType.python():
  | [2025-05-21T17:20:28Z] E                         File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type
  | [2025-05-21T17:20:28Z] E                           elif isinstance(arrow_type, pa.PyExtensionType):
  | [2025-05-21T17:20:28Z] E                         File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__
  | [2025-05-21T17:20:28Z] E                           raise e
  | [2025-05-21T17:20:28Z] E                         File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__
  | [2025-05-21T17:20:28Z] E                           return getattr(self._load_module(), name)
  | [2025-05-21T17:20:28Z] E                       AttributeError: module 'pyarrow' has no attribute 'PyExtensionType'
  | [2025-05-21T17:20:28Z]
  | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:932: RaySystemError
  | [2025-05-21T17:20:28Z] ---------------------------- Captured stderr setup -----------------------------
  | [2025-05-21T17:20:28Z] 2025-05-21 17:19:46,973	WARNING services.py:2170 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2684354560 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=4.60gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
  | [2025-05-21T17:20:28Z] 2025-05-21 17:19:47,112	INFO worker.py:1901 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
  | [2025-05-21T17:20:28Z] ----------------------------- Captured stderr call -----------------------------
  | [2025-05-21T17:20:28Z] 2025-05-21 17:19:49,133	ERROR serialization.py:462 -- module 'pyarrow' has no attribute 'PyExtensionType'
  | [2025-05-21T17:20:28Z] Traceback (most recent call last):
  | [2025-05-21T17:20:28Z]   File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects
  | [2025-05-21T17:20:28Z]     obj = self._deserialize_object(data, metadata, object_ref)
  | [2025-05-21T17:20:28Z]   File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object
  | [2025-05-21T17:20:28Z]     return self._deserialize_msgpack_data(data, metadata_fields)
  | [2025-05-21T17:20:28Z]   File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data
  | [2025-05-21T17:20:28Z]     python_objects = self._deserialize_pickle5_data(pickle5_data)
  | [2025-05-21T17:20:28Z]   File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data
  | [2025-05-21T17:20:28Z]     obj = pickle.loads(in_band, buffers=buffers)
  | [2025-05-21T17:20:28Z]   File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow
  | [2025-05-21T17:20:28Z]     if DataType.from_arrow_type(array.type) == DataType.python():
  | [2025-05-21T17:20:28Z]   File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type
  | [2025-05-21T17:20:28Z]     elif isinstance(arrow_type, pa.PyExtensionType):
  | [2025-05-21T17:20:28Z]   File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__
  | [2025-05-21T17:20:28Z]     raise e
  | [2025-05-21T17:20:28Z]   File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__
  | [2025-05-21T17:20:28Z]     return getattr(self._load_module(), name)
  | [2025-05-21T17:20:28Z] AttributeError: module 'pyarrow' has no attribute 'PyExtensionType'
  | [2025-05-21T17:20:28Z] =========================== short test summary info ============================
  | [2025-05-21T17:20:28Z] FAILED python/ray/data/tests/test_daft.py::test_daft_round_trip - ray.excepti...
  | [2025-05-21T17:20:28Z] ============================== 1 failed in 10.92s ==============================
  | [2025-05-21T17:20:28Z] ================================================================================
  |  

```

## Related issue number

<!-- For example: "Closes #1234" -->
#53278

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Srinath Krishnamachari <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

`test_streaming_fault_tolerance` sometimes raises an `AssertionError`
because the actor state isn't restarting like we expect.

This PR adds more information to the assertion so it's easier to debug
that test.

```
[2025-05-23T16:40:02Z]   File "/rayci/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 424, in update_resource_usage
--
  | [2025-05-23T16:40:02Z]     assert actor_state == gcs_pb2.ActorTableData.ActorState.RESTARTING
  | [2025-05-23T16:40:02Z] AssertionError
  | [2025-05-23T16:40:02Z] =========================== short test summary info ============================
  | [2025-05-23T16:40:02Z] FAILED python/ray/data/tests/test_streaming_integration.py::test_streaming_fault_tolerance
  | [2025-05-23T16:40:02Z] ============= 1 failed, 15 passed, 1 skipped in 149.79s (0:02:29) ==============
```
## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

`test_e2e_autoscaling_up` launches actor tasks that block until the
actor pool has launched 6 actors. If it takes longer than 10s for the
actor pool to launch 6 actors, the test fails.

Since the actor pool can sometimes launch new actors slowly, this test
can non-deterministically fail. To mitigate this issue, this PR
decreases the test to use launch the minimum number of actors (2).

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

`test_groupby_multi_agg_with_nans` occasionally fails due to slight
differences in floating point values. To make this unit test less
brittle, this PR updates the test to compare with `pytest.approx`.

```
[2025-05-23T21:19:46Z] >       assert _round_to_14_digits(expected_row) == _round_to_14_digits(result_row)
--
  | [2025-05-23T21:19:46Z] E       AssertionError: assert {'max_a': 49,...a': -0.5, ...} == {'max_a': 49,...a': -0.5, ...}
  | [2025-05-23T21:19:46Z] E         Omitting 5 identical items, use -vv to show
  | [2025-05-23T21:19:46Z] E         Differing items:
  | [2025-05-23T21:19:46Z] E         {'std_a': 29.01149197588202} != {'std_a': 29.01149197588201}
  | [2025-05-23T21:19:46Z] E         Full diff:
  | [2025-05-23T21:19:46Z] E           {
  | [2025-05-23T21:19:46Z] E            'max_a': 49,
  | [2025-05-23T21:19:46Z] E            'mean_a': -0.5,...
  | [2025-05-23T21:19:46Z] E
  | [2025-05-23T21:19:46Z] E         ...Full output truncated (8 lines hidden), use '-vv' to show
```
## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>
and use bazel run to upload stuff

Signed-off-by: Lonnie Liu <[email protected]>
Follow-up for #52712

Signed-off-by: Chi-Sheng Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
crypdick and others added 30 commits June 9, 2025 17:06
we upgraded the machine and releases

Signed-off-by: Lonnie Liu <[email protected]>
Saw this flake on
[postmerge](https://buildkite.com/ray-project/postmerge/builds/10702#01975572-3a26-4e0b-b27d-0869ac5830fe/177-1191).

Cleaned up the test in general:

- Remove sleep conditions.
- Remove use of direct gRPC connection to raylet.
- Remove runtime_env test that was redundant with
`test_runtime_env_env_vars.py`.

Runtime decreased by ~50% locally, from `62.98s` to `33.48s`.

---------

Signed-off-by: Edward Oakes <[email protected]>
These have been deprecated/ignored for a long time and are polluting the
help string.

---------

Signed-off-by: Edward Oakes <[email protected]>
… Data (#53220)

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: angelinalg <[email protected]>
the option is only used for `manylinux1`. ray is not using manylinux1 to build things any more.

---------

Signed-off-by: Gagandeep Singh <[email protected]>
Adding back "Run on Anyscale" button after a Anyscale PR was merged to
show the button on Ray docs but not Anyscale template previews

Signed-off-by: Chris Zhang <[email protected]>
…fig (#53681)

The schema of compute config that Kuberay service takes in is currently
a bit different from the schema of cluster compute in release tests.
This is a helper function built to convert the cluster compute into
Kuberay compute config that eventually gets sent into Kuberay service

---------

Signed-off-by: kevin <[email protected]>
…private to _common (#53652)

Fixes #53478

---------

Signed-off-by: abrar <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
don't know for sure, but seems like a race condition between when the
cancel happens and attempt to access the ray task result causes
`RayTaskCancelled` exception.

Used the repro script in the ticket to confirm that the issue is
resolved #53639.

---------

Signed-off-by: abrar <[email protected]>
- Add prefix `rayproject/ray` for all tags
- Authorize docker with credentials from SSM
- Mock authorize docker in unit test since it's not needed

---------

Signed-off-by: kevin <[email protected]>
Even though there's no spilling happening, ray still logs "trying to
spill." Add an early exit to avoid confusing logs.

Closes #53086

Signed-off-by: tianyi-ge <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
Unset, unused.

---------

Signed-off-by: Edward Oakes <[email protected]>
Was [deprecated](#51309) in Ray
2.44 along with Ray workflows.

---------

Signed-off-by: Edward Oakes <[email protected]>
…TaskId (#53695)

```cpp
task_spec.ParentTaskId().Binary()
```
* `ParentTaskId()` deserializes binary to `TaskId`.
* `Binary()` serializes `TaskId` to binary.

Signed-off-by: Kai-Hsun Chen <[email protected]>
…oo backend (#53319)

Adds single-controller APIs (APIs that can be called from the driver)
for creating collectives on a group of actors using
`ray.util.collective`. These APIs are currently under
`ray.experimental.collective` as they are experimental and to avoid
potential conflicts with `ray.util.collective`. See
test_experimental_collective::test_api_basic for API usage.
- create_collective_group
- destroy_collective_group
- get_collective_groups

Also adds a ray.util.collective backend based on torch.distributed gloo,
for convenient testing on CPUs. While ray.util.collective has a pygloo
backend, this backend requires pygloo to be installed, and pygloo
doesn't seem to be supported on latest versions of Python.

---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Remove some open telemetry code that was added in
#51077. These files were added to
test the complication of the open telemetry library, but we never end up
using these files.

Test:
- CI

Signed-off-by: can <[email protected]>
…t_ref` for small and non-GPU objects (#53692)

This PR is based on #53630.

See #53623 for the issue. In this PR, we clear the object ref when the
arg's tensor transport is not OBJECT_STORE.

Closes #53623 
---------

Signed-off-by: Stephanie wang <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Co-authored-by: Stephanie wang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
follow up of #53652

Signed-off-by: abrar <[email protected]>
Improve the string representation of `WorkerHealthCheckFailedError` to
also include the base reason why the health check failed.

Signed-off-by: Matthew Deng <[email protected]>
```
REGRESSION 12.82%: tasks_per_second (THROUGHPUT) regresses from 221.2222291023174 to 192.87246715163326 in benchmarks/many_nodes.json
REGRESSION 12.73%: actors_per_second (THROUGHPUT) regresses from 634.2824761754516 to 553.5098466276525 in benchmarks/many_actors.json
REGRESSION 12.26%: client__get_calls (THROUGHPUT) regresses from 1160.5254002780266 to 1018.2939193917422 in microbenchmark.json
REGRESSION 5.15%: multi_client_put_gigabytes (THROUGHPUT) regresses from 39.896743394372585 to 37.84234603653026 in microbenchmark.json
REGRESSION 4.04%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9480091293556955 to 0.909684480871914 in microbenchmark.json
REGRESSION 3.72%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8318.094433102775 to 8008.806358661164 in microbenchmark.json
REGRESSION 3.01%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2020.4236901532247 to 1959.5608579309087 in microbenchmark.json
REGRESSION 2.80%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23716.451989299432 to 23052.03512506016 in microbenchmark.json
REGRESSION 2.71%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.105537951105227 to 19.561225172916046 in microbenchmark.json
REGRESSION 2.69%: pgs_per_second (THROUGHPUT) regresses from 13.650631601393242 to 13.282795863244178 in benchmarks/many_pgs.json
REGRESSION 1.35%: single_client_tasks_async (THROUGHPUT) regresses from 8081.168521067462 to 7971.849053459262 in microbenchmark.json
REGRESSION 1.31%: n_n_actor_calls_async (THROUGHPUT) regresses from 27465.39608393524 to 27105.63998087682 in microbenchmark.json
REGRESSION 1.09%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14569.862277318796 to 14411.155262801181 in microbenchmark.json
REGRESSION 1.05%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1483.660979687764 to 1468.0999827232097 in microbenchmark.json
REGRESSION 0.92%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.796724102063072 to 12.67868528378648 in microbenchmark.json
REGRESSION 0.88%: placement_group_create/removal (THROUGHPUT) regresses from 768.9082534403586 to 762.110356621388 in microbenchmark.json
REGRESSION 0.87%: single_client_tasks_sync (THROUGHPUT) regresses from 969.5757440611114 to 961.1131766783709 in microbenchmark.json
REGRESSION 0.35%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1069.1602586173547 to 1065.4228066614364 in microbenchmark.json
REGRESSION 0.23%: client__put_gigabytes (THROUGHPUT) regresses from 0.1529268174148042 to 0.1525808986433169 in microbenchmark.json
REGRESSION 0.05%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5113.112753017668 to 5110.344528620948 in microbenchmark.json
REGRESSION 49.81%: dashboard_p99_latency_ms (LATENCY) regresses from 275.082 to 412.087 in benchmarks/many_pgs.json
REGRESSION 37.19%: dashboard_p95_latency_ms (LATENCY) regresses from 6.696 to 9.186 in benchmarks/many_pgs.json
REGRESSION 36.35%: dashboard_p95_latency_ms (LATENCY) regresses from 2283.949 to 3114.217 in benchmarks/many_actors.json
REGRESSION 13.04%: dashboard_p99_latency_ms (LATENCY) regresses from 675.061 to 763.093 in benchmarks/many_tasks.json
REGRESSION 11.46%: dashboard_p50_latency_ms (LATENCY) regresses from 3.856 to 4.298 in benchmarks/many_pgs.json
REGRESSION 11.23%: dashboard_p95_latency_ms (LATENCY) regresses from 437.195 to 486.283 in benchmarks/many_tasks.json
REGRESSION 8.97%: 107374182400_large_object_time (LATENCY) regresses from 29.323037406000026 to 31.951921509999977 in scalability/single_node.json
REGRESSION 6.24%: avg_iteration_time (LATENCY) regresses from 1.1950538015365602 to 1.2696449542045594 in stress_tests/stress_test_dead_actors.json
REGRESSION 5.86%: dashboard_p50_latency_ms (LATENCY) regresses from 8.293 to 8.779 in benchmarks/many_actors.json
REGRESSION 2.91%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 12.241764013000008 to 12.597426240999994 in scalability/object_store.json
REGRESSION 1.02%: avg_pg_remove_time_ms (LATENCY) regresses from 1.2291068678679091 to 1.2416502777781075 in stress_tests/stress_test_placement_group.json
REGRESSION 0.57%: dashboard_p50_latency_ms (LATENCY) regresses from 5.658 to 5.69 in benchmarks/many_nodes.json
REGRESSION 0.34%: 10000_args_time (LATENCY) regresses from 18.764070391999994 to 18.828636121000002 in scalability/single_node.json
```

Signed-off-by: Lonnie Liu <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.