forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
[pull] master from ray-project:master #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pull
wants to merge
7,533
commits into
garymm:master
Choose a base branch
from
ray-project:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Why are these changes needed? * The on_exit hook was introduced to allow users to perform cleanup. * However, it triggers a race condition bug in fault tolerance - after on_exit is called and the UDF is deleted, and before the actor actually exits, another retry task is submitted to the actor. * This PR disables it by default. Eventually this should be fixed in Ray Core #53169 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: lkchen <[email protected]> Co-authored-by: lkchen <[email protected]>
Convert isort config in .isort.cfg to ruff config in pyproject.toml. Conversion strategy: known_local_folder -> known-local-folder known_third_party -> known-third-party known_afterray -> Created a new section afterray sections -> section_order skip_glob -> If already exists in tool.ruff.extend-exclude then do nothing. Otherwise add a rule to per-file-ignores to ignore the I rule. Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
#53222) Signed-off-by: Arthur <[email protected]>
The test is failing periodically in CI due to memory usage being above the memory monitor threshold prior to running an actor: https://buildkite.com/ray-project/postmerge/builds/10292#0196f473-d8f0-41fa-8041-addf42660c57/179-477 I'm attempting to deflake it by converting it to follow the same pattern as its sister test and combining them using a fixture. This only requires that we're below the memory monitor threshold to start, not that we are below `0.3 * threshold`. --------- Signed-off-by: Edward Oakes <[email protected]>
Follow up to #53171 for comments after merge. Mainly just unifying the gcs pid key between Python and C++. Signed-off-by: dayshah <[email protected]>
Timing out occasionally on premerge & postmerge due to running up against the limit. Speeding up some of the tests by reducing/removing sleeps. --------- Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Linkun Chen <[email protected]>
…ort (#51032)" (#53263) This reverts commit 2c7f6d4. `test_torch_tensor_transport_gpu` is [failing on postmerge](https://buildkite.com/ray-project/postmerge/builds/10332#0196fbb9-7c80-4513-96f7-0250e53fd671/177-959). It appears this test does not run on premerge. Signed-off-by: Edward Oakes <[email protected]>
…#53253) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
The `autoscaler_cluster_resources` metric doesn't have the "instance" label so this graph is broken. If i remove the "instance" filter from the query, then this graph won't work with the instance dropdown variable. Instead, we replace with `ray_node_gpus_available` which does have the instance label. Signed-off-by: Alan Guo <[email protected]>
Attempts to fix two sources of flakiness: 1. `TypeError: Cannot read properties of null (reading 'visibilityState')` error message. We do "visibility" checks not because we want to verify the document is visible, but because we want to verify the element is visible. JSDom seems to sometimes not support the visibilityState field but its not something we really need to verify, so we mock it out. 2. Timeout for `ActorTable.component.test.tsx`. This is a long running test because it involves a lot of UI element interaction. We increase the timeout Signed-off-by: Alan Guo <[email protected]>
) Signed-off-by: Linkun <[email protected]>
to temporarily dodge the image tag limit Signed-off-by: Lonnie Liu <[email protected]>
- This PR removes an unused constants (PROCESS_TYPE_REPORTER, PROCESS_TYPE_WEB_UI). --------- Signed-off-by: Dongjun Na <[email protected]> Co-authored-by: Dhyey Shah <[email protected]>
… resources (#51978) ## Why are these changes needed? This PR enhances the logic of DeploymentScheduler._best_fit_node() to consider custom resource prioritization defined via the RAY_SERVE_CUSTOM_RESOURCES environment variable. The updated logic ensures that nodes are selected not just based on generic resources (CPU, GPU, memory), but also takes into account the specified order of custom resources when minimizing resource fragmentation. This change addresses inefficient scheduling behavior where custom resources like GRAM were ignored in prioritization, resulting in unnecessary fragmentation (e.g., #51361). <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number Closes #51361 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]> Co-authored-by: Cindy Zhang <[email protected]>
Some of the tests run much more slowly on mac/windows, so bumping them back up. Closes #43777 Signed-off-by: Edward Oakes <[email protected]>
Making node manager testable by injecting in - client call manager - worker rpc pool - core worker subscriber - object directory - object manager - plasma store client Creating mocks based on interfaces for the above when necessary. Also removing the plasma store client + actual Plasma Store Evict message sending and handling codepath because it was unused and dead code. --------- Signed-off-by: dayshah <[email protected]>
Converts `test_reference_counting_*.py` to use shared Ray instance fixtures where possible. Those tests that require their own Ray instance are moved to `test_reference_counting_standalone.py`. --------- Signed-off-by: Edward Oakes <[email protected]>
…3241) - Use `list_actors()` and `nodes()` public APIs - Remove barely-used `new_port` utility --------- Signed-off-by: Edward Oakes <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR updates to actor pool operator to label actors with logical actor IDs. This change is needed so we can disambiguate actors from their labels. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> Skip test_daft for pyarrow version >= 14 ``` [2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py::test_daft_round_trip FAILED [100%] | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] =================================== FAILURES =================================== | [2025-05-21T17:20:28Z] _____________________________ test_daft_round_trip _____________________________ | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] ray_start = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.21', ray_version='3.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}') | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] def test_daft_round_trip(ray_start): | [2025-05-21T17:20:28Z] import daft | [2025-05-21T17:20:28Z] import numpy as np | [2025-05-21T17:20:28Z] import pandas as pd | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] data = { | [2025-05-21T17:20:28Z] "int_col": list(range(128)), | [2025-05-21T17:20:28Z] "str_col": [str(i) for i in range(128)], | [2025-05-21T17:20:28Z] "nested_list_col": [[i] * 3 for i in range(128)], | [2025-05-21T17:20:28Z] "tensor_col": [np.array([[i] * 3] * 3) for i in range(128)], | [2025-05-21T17:20:28Z] } | [2025-05-21T17:20:28Z] > df = daft.from_pydict(data) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] python/ray/data/tests/test_daft.py:27: | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/convert.py:80: in from_pydict | [2025-05-21T17:20:28Z] return DataFrame._from_pydict(data) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:520: in _from_pydict | [2025-05-21T17:20:28Z] return cls._from_tables(data_micropartition) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:572: in _from_tables | [2025-05-21T17:20:28Z] df._populate_preview() | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/dataframe/dataframe.py:473: in _populate_preview | [2025-05-21T17:20:28Z] preview_parts = self._result._get_preview_micropartitions(self._num_preview_rows) | [2025-05-21T17:20:28Z] /opt/miniconda/lib/python3.9/site-packages/daft/runners/ray_runner.py:272: in _get_preview_micropartitions | [2025-05-21T17:20:28Z] part: MicroPartition = ray.get(ref) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper | [2025-05-21T17:20:28Z] return fn(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/auto_init_hook.py:21: in auto_init_wrapper | [2025-05-21T17:20:28Z] return fn(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/client_mode_hook.py:103: in wrapper | [2025-05-21T17:20:28Z] return func(*args, **kwargs) | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:2842: in get | [2025-05-21T17:20:28Z] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) | [2025-05-21T17:20:28Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] self = <ray._private.worker.Worker object at 0x7ff6bcc8b220> | [2025-05-21T17:20:28Z] object_refs = [ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000002e1f505)] | [2025-05-21T17:20:28Z] timeout = None, return_exceptions = False, skip_deserialization = False | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] def get_objects( | [2025-05-21T17:20:28Z] self, | [2025-05-21T17:20:28Z] object_refs: list, | [2025-05-21T17:20:28Z] timeout: Optional[float] = None, | [2025-05-21T17:20:28Z] return_exceptions: bool = False, | [2025-05-21T17:20:28Z] skip_deserialization: bool = False, | [2025-05-21T17:20:28Z] ): | [2025-05-21T17:20:28Z] """Get the values in the object store associated with the IDs. | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] Return the values from the local object store for object_refs. This | [2025-05-21T17:20:28Z] will block until all the values for object_refs have been written to | [2025-05-21T17:20:28Z] the local object store. | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] Args: | [2025-05-21T17:20:28Z] object_refs: A list of the object refs | [2025-05-21T17:20:28Z] whose values should be retrieved. | [2025-05-21T17:20:28Z] timeout: The maximum amount of time in | [2025-05-21T17:20:28Z] seconds to wait before returning. | [2025-05-21T17:20:28Z] return_exceptions: If any of the objects deserialize to an | [2025-05-21T17:20:28Z] Exception object, whether to return them as values in the | [2025-05-21T17:20:28Z] returned list. If False, then the first found exception will be | [2025-05-21T17:20:28Z] raised. | [2025-05-21T17:20:28Z] skip_deserialization: If true, only the buffer will be released and | [2025-05-21T17:20:28Z] the object associated with the buffer will not be deserialized. | [2025-05-21T17:20:28Z] Returns: | [2025-05-21T17:20:28Z] list: List of deserialized objects or None if skip_deserialization is True. | [2025-05-21T17:20:28Z] bytes: UUID of the debugger breakpoint we should drop | [2025-05-21T17:20:28Z] into or b"" if there is no breakpoint. | [2025-05-21T17:20:28Z] """ | [2025-05-21T17:20:28Z] # Make sure that the values are object refs. | [2025-05-21T17:20:28Z] for object_ref in object_refs: | [2025-05-21T17:20:28Z] if not isinstance(object_ref, ObjectRef): | [2025-05-21T17:20:28Z] raise TypeError( | [2025-05-21T17:20:28Z] f"Attempting to call `get` on the value {object_ref}, " | [2025-05-21T17:20:28Z] "which is not an ray.ObjectRef." | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] timeout_ms = ( | [2025-05-21T17:20:28Z] int(timeout * 1000) if timeout is not None and timeout != -1 else -1 | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] data_metadata_pairs: List[ | [2025-05-21T17:20:28Z] Tuple[ray._raylet.Buffer, bytes] | [2025-05-21T17:20:28Z] ] = self.core_worker.get_objects( | [2025-05-21T17:20:28Z] object_refs, | [2025-05-21T17:20:28Z] timeout_ms, | [2025-05-21T17:20:28Z] ) | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] debugger_breakpoint = b"" | [2025-05-21T17:20:28Z] for data, metadata in data_metadata_pairs: | [2025-05-21T17:20:28Z] if metadata: | [2025-05-21T17:20:28Z] metadata_fields = metadata.split(b",") | [2025-05-21T17:20:28Z] if len(metadata_fields) >= 2 and metadata_fields[1].startswith( | [2025-05-21T17:20:28Z] ray_constants.OBJECT_METADATA_DEBUG_PREFIX | [2025-05-21T17:20:28Z] ): | [2025-05-21T17:20:28Z] debugger_breakpoint = metadata_fields[1][ | [2025-05-21T17:20:28Z] len(ray_constants.OBJECT_METADATA_DEBUG_PREFIX) : | [2025-05-21T17:20:28Z] ] | [2025-05-21T17:20:28Z] if skip_deserialization: | [2025-05-21T17:20:28Z] return None, debugger_breakpoint | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] values = self.deserialize_objects(data_metadata_pairs, object_refs) | [2025-05-21T17:20:28Z] if not return_exceptions: | [2025-05-21T17:20:28Z] # Raise exceptions instead of returning them to the user. | [2025-05-21T17:20:28Z] for i, value in enumerate(values): | [2025-05-21T17:20:28Z] if isinstance(value, RayError): | [2025-05-21T17:20:28Z] if isinstance(value, ray.exceptions.ObjectLostError): | [2025-05-21T17:20:28Z] global_worker.core_worker.dump_object_store_memory_usage() | [2025-05-21T17:20:28Z] if isinstance(value, RayTaskError): | [2025-05-21T17:20:28Z] raise value.as_instanceof_cause() | [2025-05-21T17:20:28Z] else: | [2025-05-21T17:20:28Z] > raise value | [2025-05-21T17:20:28Z] E ray.exceptions.RaySystemError: System error: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] E traceback: Traceback (most recent call last): | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects | [2025-05-21T17:20:28Z] E obj = self._deserialize_object(data, metadata, object_ref) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object | [2025-05-21T17:20:28Z] E return self._deserialize_msgpack_data(data, metadata_fields) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data | [2025-05-21T17:20:28Z] E python_objects = self._deserialize_pickle5_data(pickle5_data) | [2025-05-21T17:20:28Z] E File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data | [2025-05-21T17:20:28Z] E obj = pickle.loads(in_band, buffers=buffers) | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow | [2025-05-21T17:20:28Z] E if DataType.from_arrow_type(array.type) == DataType.python(): | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type | [2025-05-21T17:20:28Z] E elif isinstance(arrow_type, pa.PyExtensionType): | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__ | [2025-05-21T17:20:28Z] E raise e | [2025-05-21T17:20:28Z] E File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__ | [2025-05-21T17:20:28Z] E return getattr(self._load_module(), name) | [2025-05-21T17:20:28Z] E AttributeError: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] | [2025-05-21T17:20:28Z] /rayci/python/ray/_private/worker.py:932: RaySystemError | [2025-05-21T17:20:28Z] ---------------------------- Captured stderr setup ----------------------------- | [2025-05-21T17:20:28Z] 2025-05-21 17:19:46,973 WARNING services.py:2170 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2684354560 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=4.60gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. | [2025-05-21T17:20:28Z] 2025-05-21 17:19:47,112 INFO worker.py:1901 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 | [2025-05-21T17:20:28Z] ----------------------------- Captured stderr call ----------------------------- | [2025-05-21T17:20:28Z] 2025-05-21 17:19:49,133 ERROR serialization.py:462 -- module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] Traceback (most recent call last): | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 460, in deserialize_objects | [2025-05-21T17:20:28Z] obj = self._deserialize_object(data, metadata, object_ref) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 317, in _deserialize_object | [2025-05-21T17:20:28Z] return self._deserialize_msgpack_data(data, metadata_fields) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 272, in _deserialize_msgpack_data | [2025-05-21T17:20:28Z] python_objects = self._deserialize_pickle5_data(pickle5_data) | [2025-05-21T17:20:28Z] File "/rayci/python/ray/_private/serialization.py", line 260, in _deserialize_pickle5_data | [2025-05-21T17:20:28Z] obj = pickle.loads(in_band, buffers=buffers) | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/series.py", line 36, in from_arrow | [2025-05-21T17:20:28Z] if DataType.from_arrow_type(array.type) == DataType.python(): | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/datatype.py", line 457, in from_arrow_type | [2025-05-21T17:20:28Z] elif isinstance(arrow_type, pa.PyExtensionType): | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 63, in __getattr__ | [2025-05-21T17:20:28Z] raise e | [2025-05-21T17:20:28Z] File "/opt/miniconda/lib/python3.9/site-packages/daft/lazy_import.py", line 53, in __getattr__ | [2025-05-21T17:20:28Z] return getattr(self._load_module(), name) | [2025-05-21T17:20:28Z] AttributeError: module 'pyarrow' has no attribute 'PyExtensionType' | [2025-05-21T17:20:28Z] =========================== short test summary info ============================ | [2025-05-21T17:20:28Z] FAILED python/ray/data/tests/test_daft.py::test_daft_round_trip - ray.excepti... | [2025-05-21T17:20:28Z] ============================== 1 failed in 10.92s ============================== | [2025-05-21T17:20:28Z] ================================================================================ | ``` ## Related issue number <!-- For example: "Closes #1234" --> #53278 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> `test_streaming_fault_tolerance` sometimes raises an `AssertionError` because the actor state isn't restarting like we expect. This PR adds more information to the assertion so it's easier to debug that test. ``` [2025-05-23T16:40:02Z] File "/rayci/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 424, in update_resource_usage -- | [2025-05-23T16:40:02Z] assert actor_state == gcs_pb2.ActorTableData.ActorState.RESTARTING | [2025-05-23T16:40:02Z] AssertionError | [2025-05-23T16:40:02Z] =========================== short test summary info ============================ | [2025-05-23T16:40:02Z] FAILED python/ray/data/tests/test_streaming_integration.py::test_streaming_fault_tolerance | [2025-05-23T16:40:02Z] ============= 1 failed, 15 passed, 1 skipped in 149.79s (0:02:29) ============== ``` ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> `test_e2e_autoscaling_up` launches actor tasks that block until the actor pool has launched 6 actors. If it takes longer than 10s for the actor pool to launch 6 actors, the test fails. Since the actor pool can sometimes launch new actors slowly, this test can non-deterministically fail. To mitigate this issue, this PR decreases the test to use launch the minimum number of actors (2). ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Balaji Veeramani <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> `test_groupby_multi_agg_with_nans` occasionally fails due to slight differences in floating point values. To make this unit test less brittle, this PR updates the test to compare with `pytest.approx`. ``` [2025-05-23T21:19:46Z] > assert _round_to_14_digits(expected_row) == _round_to_14_digits(result_row) -- | [2025-05-23T21:19:46Z] E AssertionError: assert {'max_a': 49,...a': -0.5, ...} == {'max_a': 49,...a': -0.5, ...} | [2025-05-23T21:19:46Z] E Omitting 5 identical items, use -vv to show | [2025-05-23T21:19:46Z] E Differing items: | [2025-05-23T21:19:46Z] E {'std_a': 29.01149197588202} != {'std_a': 29.01149197588201} | [2025-05-23T21:19:46Z] E Full diff: | [2025-05-23T21:19:46Z] E { | [2025-05-23T21:19:46Z] E 'max_a': 49, | [2025-05-23T21:19:46Z] E 'mean_a': -0.5,... | [2025-05-23T21:19:46Z] E | [2025-05-23T21:19:46Z] E ...Full output truncated (8 lines hidden), use '-vv' to show ``` ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Balaji Veeramani <[email protected]>
and use bazel run to upload stuff Signed-off-by: Lonnie Liu <[email protected]>
Follow-up for #52712 Signed-off-by: Chi-Sheng Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
we upgraded the machine and releases Signed-off-by: Lonnie Liu <[email protected]>
Saw this flake on [postmerge](https://buildkite.com/ray-project/postmerge/builds/10702#01975572-3a26-4e0b-b27d-0869ac5830fe/177-1191). Cleaned up the test in general: - Remove sleep conditions. - Remove use of direct gRPC connection to raylet. - Remove runtime_env test that was redundant with `test_runtime_env_env_vars.py`. Runtime decreased by ~50% locally, from `62.98s` to `33.48s`. --------- Signed-off-by: Edward Oakes <[email protected]>
These have been deprecated/ignored for a long time and are polluting the help string. --------- Signed-off-by: Edward Oakes <[email protected]>
… Data (#53220) --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: angelinalg <[email protected]>
the option is only used for `manylinux1`. ray is not using manylinux1 to build things any more. --------- Signed-off-by: Gagandeep Singh <[email protected]>
) Signed-off-by: hipudding <[email protected]>
Adding back "Run on Anyscale" button after a Anyscale PR was merged to show the button on Ray docs but not Anyscale template previews Signed-off-by: Chris Zhang <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
…fig (#53681) The schema of compute config that Kuberay service takes in is currently a bit different from the schema of cluster compute in release tests. This is a helper function built to convert the cluster compute into Kuberay compute config that eventually gets sent into Kuberay service --------- Signed-off-by: kevin <[email protected]>
…private to _common (#53652) Fixes #53478 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: abrar <[email protected]>
don't know for sure, but seems like a race condition between when the cancel happens and attempt to access the ray task result causes `RayTaskCancelled` exception. Used the repro script in the ticket to confirm that the issue is resolved #53639. --------- Signed-off-by: abrar <[email protected]>
- Add prefix `rayproject/ray` for all tags - Authorize docker with credentials from SSM - Mock authorize docker in unit test since it's not needed --------- Signed-off-by: kevin <[email protected]>
Even though there's no spilling happening, ray still logs "trying to spill." Add an early exit to avoid confusing logs. Closes #53086 Signed-off-by: tianyi-ge <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>
Unset, unused. --------- Signed-off-by: Edward Oakes <[email protected]>
Process exits ungracefully when you `os._exit` on Windows: https://buildkite.com/ray-project/postmerge/builds/10736#01975889-d332-47e0-8290-284c63ec43b3/1834-1905 Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Nikhil Ghosh <[email protected]>
Was [deprecated](#51309) in Ray 2.44 along with Ray workflows. --------- Signed-off-by: Edward Oakes <[email protected]>
…TaskId (#53695) ```cpp task_spec.ParentTaskId().Binary() ``` * `ParentTaskId()` deserializes binary to `TaskId`. * `Binary()` serializes `TaskId` to binary. Signed-off-by: Kai-Hsun Chen <[email protected]>
…53686) Fixes: #53478 --------- Signed-off-by: Nehil Jain <[email protected]>
…oo backend (#53319) Adds single-controller APIs (APIs that can be called from the driver) for creating collectives on a group of actors using `ray.util.collective`. These APIs are currently under `ray.experimental.collective` as they are experimental and to avoid potential conflicts with `ray.util.collective`. See test_experimental_collective::test_api_basic for API usage. - create_collective_group - destroy_collective_group - get_collective_groups Also adds a ray.util.collective backend based on torch.distributed gloo, for convenient testing on CPUs. While ray.util.collective has a pygloo backend, this backend requires pygloo to be installed, and pygloo doesn't seem to be supported on latest versions of Python. --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
Remove some open telemetry code that was added in #51077. These files were added to test the complication of the open telemetry library, but we never end up using these files. Test: - CI Signed-off-by: can <[email protected]>
…t_ref` for small and non-GPU objects (#53692) This PR is based on #53630. See #53623 for the issue. In this PR, we clear the object ref when the arg's tensor transport is not OBJECT_STORE. Closes #53623 --------- Signed-off-by: Stephanie wang <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Stephanie wang <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>
follow up of #53652 Signed-off-by: abrar <[email protected]>
Improve the string representation of `WorkerHealthCheckFailedError` to also include the base reason why the health check failed. Signed-off-by: Matthew Deng <[email protected]>
…#53718) Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: Timothy Seah <[email protected]>
``` REGRESSION 12.82%: tasks_per_second (THROUGHPUT) regresses from 221.2222291023174 to 192.87246715163326 in benchmarks/many_nodes.json REGRESSION 12.73%: actors_per_second (THROUGHPUT) regresses from 634.2824761754516 to 553.5098466276525 in benchmarks/many_actors.json REGRESSION 12.26%: client__get_calls (THROUGHPUT) regresses from 1160.5254002780266 to 1018.2939193917422 in microbenchmark.json REGRESSION 5.15%: multi_client_put_gigabytes (THROUGHPUT) regresses from 39.896743394372585 to 37.84234603653026 in microbenchmark.json REGRESSION 4.04%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9480091293556955 to 0.909684480871914 in microbenchmark.json REGRESSION 3.72%: 1_n_actor_calls_async (THROUGHPUT) regresses from 8318.094433102775 to 8008.806358661164 in microbenchmark.json REGRESSION 3.01%: 1_1_actor_calls_sync (THROUGHPUT) regresses from 2020.4236901532247 to 1959.5608579309087 in microbenchmark.json REGRESSION 2.80%: n_n_async_actor_calls_async (THROUGHPUT) regresses from 23716.451989299432 to 23052.03512506016 in microbenchmark.json REGRESSION 2.71%: single_client_put_gigabytes (THROUGHPUT) regresses from 20.105537951105227 to 19.561225172916046 in microbenchmark.json REGRESSION 2.69%: pgs_per_second (THROUGHPUT) regresses from 13.650631601393242 to 13.282795863244178 in benchmarks/many_pgs.json REGRESSION 1.35%: single_client_tasks_async (THROUGHPUT) regresses from 8081.168521067462 to 7971.849053459262 in microbenchmark.json REGRESSION 1.31%: n_n_actor_calls_async (THROUGHPUT) regresses from 27465.39608393524 to 27105.63998087682 in microbenchmark.json REGRESSION 1.09%: client__tasks_and_put_batch (THROUGHPUT) regresses from 14569.862277318796 to 14411.155262801181 in microbenchmark.json REGRESSION 1.05%: 1_1_async_actor_calls_sync (THROUGHPUT) regresses from 1483.660979687764 to 1468.0999827232097 in microbenchmark.json REGRESSION 0.92%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.796724102063072 to 12.67868528378648 in microbenchmark.json REGRESSION 0.88%: placement_group_create/removal (THROUGHPUT) regresses from 768.9082534403586 to 762.110356621388 in microbenchmark.json REGRESSION 0.87%: single_client_tasks_sync (THROUGHPUT) regresses from 969.5757440611114 to 961.1131766783709 in microbenchmark.json REGRESSION 0.35%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1069.1602586173547 to 1065.4228066614364 in microbenchmark.json REGRESSION 0.23%: client__put_gigabytes (THROUGHPUT) regresses from 0.1529268174148042 to 0.1525808986433169 in microbenchmark.json REGRESSION 0.05%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 5113.112753017668 to 5110.344528620948 in microbenchmark.json REGRESSION 49.81%: dashboard_p99_latency_ms (LATENCY) regresses from 275.082 to 412.087 in benchmarks/many_pgs.json REGRESSION 37.19%: dashboard_p95_latency_ms (LATENCY) regresses from 6.696 to 9.186 in benchmarks/many_pgs.json REGRESSION 36.35%: dashboard_p95_latency_ms (LATENCY) regresses from 2283.949 to 3114.217 in benchmarks/many_actors.json REGRESSION 13.04%: dashboard_p99_latency_ms (LATENCY) regresses from 675.061 to 763.093 in benchmarks/many_tasks.json REGRESSION 11.46%: dashboard_p50_latency_ms (LATENCY) regresses from 3.856 to 4.298 in benchmarks/many_pgs.json REGRESSION 11.23%: dashboard_p95_latency_ms (LATENCY) regresses from 437.195 to 486.283 in benchmarks/many_tasks.json REGRESSION 8.97%: 107374182400_large_object_time (LATENCY) regresses from 29.323037406000026 to 31.951921509999977 in scalability/single_node.json REGRESSION 6.24%: avg_iteration_time (LATENCY) regresses from 1.1950538015365602 to 1.2696449542045594 in stress_tests/stress_test_dead_actors.json REGRESSION 5.86%: dashboard_p50_latency_ms (LATENCY) regresses from 8.293 to 8.779 in benchmarks/many_actors.json REGRESSION 2.91%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 12.241764013000008 to 12.597426240999994 in scalability/object_store.json REGRESSION 1.02%: avg_pg_remove_time_ms (LATENCY) regresses from 1.2291068678679091 to 1.2416502777781075 in stress_tests/stress_test_placement_group.json REGRESSION 0.57%: dashboard_p50_latency_ms (LATENCY) regresses from 5.658 to 5.69 in benchmarks/many_nodes.json REGRESSION 0.34%: 10000_args_time (LATENCY) regresses from 18.764070391999994 to 18.828636121000002 in scalability/single_node.json ``` Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )