Skip to content

Releases: ray-project/ray

Ray-2.8.1

01 Dec 23:08
3546c41
Compare
Choose a tag to compare

Release Highlights

The Ray 2.8.1 patch release contains fixes for the Ray Dashboard.

Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023

Ray Dashboard

🔨 Fixes:

[core][state][log] Cherry pick changes to prevent state API from reading files outside the Ray log directory (#41520)
[Dashboard] Migrate Logs page to use state api. (#41474) (#41522)

Ray-2.8.0

03 Nov 00:55
dd270c8
Compare
Choose a tag to compare

Release Highlights

This release features stability improvements and API clean-ups across the Ray libraries.

  • In Ray Serve, we are deprecating the previously experimental DAG API for deployment graphs. Model composition will be supported through deployment handles providing more flexibility and stability. The previously deprecated Ray Serve 1.x APIs have also been removed. We’ve also added a new Java APIs that aligns with the Ray Serve 2.x APIs. More API changes in the release notes below.
  • In RLlib, we’ve moved 24 algorithms into rllib_contrib (still available within RLlib for Ray 2.8).
  • We’ve added support for PyTorch-compatible input files shuffling for Ray Data. This allows users to randomly shuffle input files for better model training accuracy. This release also features new Ray Data datasources for Databricks and BigQuery.
  • On the Ray Dashboard, we’ve added new metrics for Ray Data in the Metrics tab. This allows users to monitor Ray Data workload including real time metrics of cluster memory, CPU, GPU, output data size, etc. See the doc for more details.
  • Ray Core now supports profiling GPU tasks or actors using Nvidia Nsight. See the documentation for instructions.
  • We fixed 2 critical bugs raised by many kuberay / ML library users, including a child process leak issue from Ray worker that leaks the GPU memory (#40182) and an job page excessive loading time issue when Ray HA cluster restarts a head node (#40742)
  • Python 3.7 support is officially deprecated from Ray.

Ray Libraries

Ray Data

🎉 New Features:

  • Add support for shuffling input files (#40154)
  • Support streaming read of PyTorch dataset (#39554)
  • Add BigQuery datasource (#37380)
  • Add Databricks table / SQL datasource (#39852)
  • Add inverse transform functionality to LabelEncoder (#37785)
  • Add function arg params to Dataset.map and Dataset.flat_map (#40010)

💫Enhancements:

  • Hard deprecate DatasetPipeline (#40129)
  • Remove BulkExecutor code path (#40200)
  • Deprecate extraneous Dataset parameters and methods (#40385)
  • Remove legacy iteration code path (#40013)
  • Implement streaming output backpressure (#40387)
  • Cap op concurrency with exponential ramp-up (#40275)
  • Store ray dashboard metrics in _StatsActor (#40118)
  • Slice output blocks to respect target block size (#40248)
  • Drop columns before grouping by in Dataset.unique() (#40016)
  • Standardize physical operator runtime metrics (#40173)
  • Estimate blocks for limit and union operator (#40072)
  • Store bytes spilled/restored after plan execution (#39361)
  • Optimize sample_boundaries in SortTaskSpec (#39581)
  • Optimization to reduce ArrowBlock building time for blocks of size 1 (#38833)

🔨 Fixes:

  • Fix bug where _StatsActor errors with PandasBlock (#40481)
  • Remove deprecated do_write (#40422)
  • Improve error message when reading HTTP files (#40462)
  • Add flag to skip get_object_locations for metrics (#39884)
  • Fall back to fetch files info in parallel for multiple directories (#39592)
  • Replace deprecated .pieces with updated .fragments (#39523)
  • Backwards compatibility for Preprocessor that have been fit in older versions (#39173)
  • Removing unnecessary data copy in convert_udf_returns_to_numpy (#39188)
  • Do not eagerly free root RefBundles (#39016)

📖Documentation:

  • Remove out-of-date Data examples (#40127)
  • Remove unused and outdated source examples (#40271)

Ray Train

🎉 New Features:

  • Add initial support for scheduling workers on neuron_cores (#39091)

💫Enhancements:

  • Update PyTorch Lightning import path to support both pytorch_lightning and lightning (#39841, #40266)
  • Propagate driver DataContext to RayTrainWorkers (#40116)

🔨 Fixes:

  • Fix error propagation for as_directory if to_directory fails (#40025)

📖Documentation:

  • Update checkpoint hierarchy documentation for RayTrainReportCallbacks. (#40174)
  • Update Lightning RayDDPStrategy docstring (#40376)

🏗 Architecture refactoring:

  • Deprecate LightningTrainer, AccelerateTrainer, `TransformersTrainer (#40163)
  • Clean up legacy persistence mode code paths (#39921, #40061, #40069, #40168)
  • Deprecate legacy DatasetConfig (#39963)
  • Remove references to DatasetPipeline (#40159)
  • Enable isort (#40172)

Ray Tune

💫Enhancements:

Ray Serve

💫Enhancements:

  • The single-app configuration format for the Serve Config (i.e. the Serve Config without the ‘applications’ field) has been deprecated in favor of the new configuration format.
    Both single-app configuration and DAG API will be removed in 2.9.
  • The Serve REST API is now accessible through the dashboard port, which defaults to 8265.
  • Accessing the Serve REST API through the dashboard agent port (default 52365) is deprecated. The support will be removed in a future version.
  • Ray job error tracebacks are now logged in the job driver log for easier access when jobs fail during start up.
  • Deprecated single-application config file
  • Deprecated DAG API: InputNode and DAGDriver
  • Removed deprecated Deployment 1.x APIs: Deployment.deploy(), Deployment.delete(), Deployment.get_handle()
  • Removed deprecated 1.x API: serve.get_deployment and serve.list_deployments
  • New Java API supported (aligns with Ray Serve 2.x API)

🔨 Fixes:

  • The dedicated_cpu and detached options in serve.start() have been fully disallowed.
  • Error will be raised when users pass invalid gRPC service functions and fail early.
  • The proxy’s readiness check now uses a linear backoff to avoid getting stuck in an infinite loop if it takes longer than usual to start.
  • grpc_options on serve.start() was only allowing a gRPCOptions object in Ray 2.7.0. Dictionaries are now allowed to be used asgrpc_options in the serve.start() call.

RLlib

💫Enhancements:

  • rllib_contrib algorithms (A2C, A3C, AlphaStar #36584, AlphaZero #36736, ApexDDPG #36596, ApexDQN #36591, ARS #36607, Bandits #36612, CRR #36616, DDPG, DDPPO #36620, Dreamer(V1), DT #36623, ES #36625, LeelaChessZero #36627, MA-DDPG #36628, MAML, MB-MPO #36662, PG #36666, QMix #36682, R2D2, SimpleQ #36688, SlateQ #36710, and TD3 #36726) all produce warnings now if used. See here for more information on the rllib_contrib efforts. (36620, 36628, 3
  • Provide msgpack checkpoint translation utility to convert checkpoint into msgpack format for being able to move in between python versions (#38825).

🔨 Fixes:

  • Issue 35440 (JSON output writer should include INFOS #39632)
  • Issue 39453 (PettingZoo wrappers should use correct multi-agent dict spaces #39459)
  • Issue 39421 (Multi-discrete action spaces not supported in new stack #39534)
  • Issue 39234 (Multi-categorical distribution bug #39464)
    #39654, #35975, #39552, #38555

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Python 3.7 support is officially deprecated from Ray.
  • Supports profiling GPU tasks or actors using Nvidia Nsight. See the doc for instructions.
  • Ray on spark autoscaling is officially supported from Ray 2.8. See the REP for more details.
    💫Enhancements:
  • IDLE node information in detail is available from ray status -v (#39638)
  • Adding a new accelerator to Ray is simplified with a new accelerator interface. See the in-flight REP for more details (#40286).
  • Typing_extensions is removed from a dependency requirement because Python 3.7 support is deprecated. (#40336)
  • Ray state API supports case insensitive match. (#34577)
  • ray start --runtime-env-agent-port is officially supported. (#39919)
  • Driver exit code is available fromjob info (#39675)

🔨 Fixes:

  • Fixed a worker leak when Ray is used with placement group because Ray didn’t handle SIGTERM properly (#40182)
  • Fixed an issue job page loading takes a really long time when Ray HA cluster restarts a head node (#40431)
  • [core] loosen the check on release object (#39570)
  • [Core] ray init sigterm (#39816)
  • [Core] Non Unit Instance fractional value fix (#39293)
  • [Core]: Enable get_actor_name for actor runtime context (#39347)
  • [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292

📖Documentation:

Ray Clusters

💫Enhancements:

  • Enable GPU support for vSphere cluster launcher (#40667)

📖Documentation:

  • Setup RBAC by KubeRay Helm chart
  • KubeRay upgrade documentation
  • RayService high availability

🔨 Fixes:

Dashboard

🎉 New Features:

  • New metrics for ray data can be found in the Metrics tab.
    🔨 Fixes:
  • Fix bug where download log button did not download all logs for actors.

Thanks

Many thanks to all who contribute...

Read more

Ray-2.7.1

09 Oct 17:47
9f07c12
Compare
Choose a tag to compare

Release Highlights

  • Ray Serve:
    • Added an application tag to the ray_serve_num_http_error_requests metric
    • Fixed a bug where no data shows up on the Error QPS per Application panel in the Ray Dashboard
  • RLlib:
    • DreamerV3: Bug fix enabling support for continuous actions.
  • Ray Train:
    • Fix a bug where setting a local storage path on Windows errors (#39951)
  • Ray Tune:
    • Fix a broken Trial.node_ip property (#40028)
  • Ray Core:
    • Fixes a segfault when a streaming generator and actor cancel is used together
    • Fix autoscaler sdk accidentally initialize ray worker leading to leaked driver showing up in the dashboard.
    • Added a new user guide and fixes for the vSphere cluster launcher.
    • Fixed a bug where ray start would occasionally fail with ValueError: acceleratorType should match v(generation)-(cores/chips).
  • Dashboard:
    • Improvement on cluster page UI
    • Fix a bug that overview page UI will crash

Ray Libraries

Ray Serve

🔨 Fixes:

  • Fixed a bug where no data shows up on the Error QPS per Application panel in the Ray Dashboard

RLlib

🔨 Fixes:

  • DreamerV3: Bug fix enabling support for continuous actions (39751).

Ray Core and Ray Clusters

🔨 Fixes:

  • Fixed Ray cluster stability on a high latency environment

Thanks

Many thanks to all those who contributed to this release!

@chaowanggg, @allenwang28, @shrekris-anyscale, @GeneDer, @justinvyu, @can-anyscale, @edoakes, @architkulkarni, @rkooo567, @rynewang, @rickyyx, @sven1977

Ray-2.7.0

17 Sep 18:13
b4bba47
Compare
Choose a tag to compare

Release Highlights

Ray 2.7 release brings important stability improvements and enhancements to Ray libraries, with Ray Train and Ray Serve becoming generally available. Ray 2.7 is accompanied with a GA release of KubeRay.

  • Following user feedback, we are rebranding “Ray AI Runtime (AIR)” to “Ray AI Libraries”. Without reducing any of the underlying functionality of the original Ray AI runtime vision as put forth in Ray 2.0, the underlying namespace (ray.air) is consolidated into ray.data, ray.train, and ray.tune. This change reduces the friction for new machine learning (ML) practitioners to quickly understand and leverage Ray for their production machine learning use cases.
  • With this release, Ray Serve and Ray Train’s Pytorch support are becoming Generally Available -- indicating that the core APIs have been marked stable and that both libraries have undergone significant production hardening.
  • In Ray Serve, we are introducing a new backwards-compatible DeploymentHandle API to unify various existing Handle APIs, a high performant gRPC proxy to serve gRPC requests through Ray Serve, along with various stability and usability improvements.
  • In Ray Train, we are consolidating various Pytorch-based trainers into the TorchTrainer, reducing the amount of refactoring work new users needed to scale existing training scripts. We are also introducing a new train.Checkpoint API, which provides a consolidated way of interacting with remote and local storage, along with various stability and usability improvements.
  • In Ray Core, we’ve added initial integrations with TPUs and AWS accelerators, enabling Ray to natively detect these devices and schedule tasks/actors onto them. Ray Core also officially now supports actor task cancellation and has an experimental streaming generator that supports streaming response to the caller.

Take a look at our refreshed documentation and the Ray 2.7 migration guide and let us know your feedback!

Ray Libraries

Ray AIR

🏗 Architecture refactoring:

Ray Data

🎉 New Features:

  • In this release, we’ve integrated the Ray Core streaming generator API by default, which allows us to reduce memory footprint throughout the data pipeline (#37736).
  • Avoid unnecessary data buffering between Read and Map operator (zero-copy fusion) (#38789)
  • Add Dataset.write_images to write images (#38228)
  • Add Dataset.write_sql() to write SQL databases (#38544)
  • Support sort on multiple keys (#37124)
  • Support reading and writing JSONL file format (#37637)
  • Support class constructor args for Dataset.map() and flat_map() (#38606)
  • Implement streamed read from Hugging Face Dataset (#38432)

💫Enhancements:

  • Read data with multi-threading for FileBasedDataSource (#39493)
  • Optimization to reduce ArrowBlock building time for blocks of size 1 (#38988)
  • Add partition_filter parameter to read_parquet (#38479)
  • Apply limit to Dataset.take() and related methods (#38677)
  • Postpone reader.get_read_tasks until execution (#38373)
  • Lazily construct metadata providers (#38198)
  • Support writing each block to a separate file (#37986)
  • Make iter_batches an Iterable (#37881)
  • Remove default limit on Dataset.to_pandas() (#37420)
  • Add Dataset.to_dask() parameter to toggle consistent metadata check (#37163)
  • Add Datasource.on_write_start (#38298)
  • Remove support for DatasetDict as input into from_huggingface() (#37555)

🔨 Fixes:

  • Backwards compatibility for Preprocessor that have been fit in older versions (#39488)
  • Do not eagerly free root RefBundles (#39085)
  • Retry open files with exponential backoff (#38773)
  • Avoid passing local_uri to all non-Parquet data sources (#38719)
  • Add ctx parameter to Datasource.write (#38688)
  • Preserve block format on map_batches over empty blocks (#38161)
  • Fix args and kwargs passed to ActorPool map_batches (#38110)
  • Add tif file extension to ImageDatasource (#38129)
  • Raise error if PIL can't load image (#38030)
  • Allow automatic handling of string features as byte features during TFRecord serialization (#37995)
  • Remove unnecessary file system wrapping (#38299)
  • Remove _block_udf from FileBasedDatasource reads (#38111)

📖Documentation:

Ray Train

🤝 API Changes

💫Enhancements:

  • Various improvements and fixes for the console output of Ray Train and Tune (#37572, #37571, #37570, #37569, #37531, #36993)
  • Raise actionable error message for missing dependencies (#38497)
  • Use posix paths throughout library code (#38319)
  • Group consecutive workers by IP (#38490)
  • Split all Ray Datasets by default (#38694)
  • Add static Trainer methods for getting tree-based models (#38344)
  • Don't set rank-specific local directories for Train workers (#38007)

🔨 Fixes:

  • Fix trainer restoration from S3 (#38251)

🏗 Architecture refactoring:

📖Documentation:

Ray Tune

🤝 API Changes

💫Enhancements:

🔨 Fixes:

🏗 Architecture refactoring:

Ray Serve

🎉 New Features:

  • Added keep_alive_timeout_s to Serve config file to allow users to configure HTTP proxy’s duration to keep idle connections alive when no requests are ongoing.
  • Added gRPC proxy to serve gRPC requests through Ray Serve. It comes with feature parity with HTTP while offering better performance. Also, replaces the previous experimental gRPC direct ingress.
  • Ray 2.7 introduces a new DeploymentHandle API that will replace the existing RayServeHandle and RayServeSyncHandle APIs in a future release. You are encoura...
Read more

Ray-2.6.3

15 Aug 16:40
8a434b4
Compare
Choose a tag to compare

The Ray 2.6.3 patch release contains fixes for Ray Serve, and Ray Core streaming generators.

Ray Core

🔨 Fixes:

  • [Core][Streaming Generator] Fix memory leak from the end of object stream object #38152 (#38206)

Ray Serve

🔨 Fixes:

  • [Serve] Fix serve run help message (#37859) (#38018)
  • [Serve] Decrement ray_serve_deployment_queued_queries when client disconnects (#37965) (#38020)

RLib

📖 Documentation:

Ray-2.6.2

03 Aug 23:54
f79203d
Compare
Choose a tag to compare

The Ray 2.6.2 patch release contains a critical fix for ray's logging setup, as well fixes for Ray Serve, Ray Data, and Ray Job.

Ray Core

🔨 Fixes:

  • [Core] Pass logs through if sphinx-doctest is running (#36306) (#37879)
  • [cluster-launcher] Pick GCP cluster launcher tests and fix (#37797)

Ray Serve

🔨 Fixes:

  • [Serve] Apply request_timeout_s from Serve config to the cluster (#37884) (#37903)

Ray Air

🔨 Fixes:

Ray-2.6.1

24 Jul 05:07
d68bf04
Compare
Choose a tag to compare

The Ray 2.6.1 patch release contains a critical fix for cluster launcher, and compatibility update for Ray Serve protobuf definition with python 3.11, as well doc improvements.

⚠️ Cluster launcher in Ray 2.6.0 fails to start multi-node clusters. Please update to 2.6.1 if you plan to use 2.6.0 cluster launcher.

Ray Core

🔨 Fixes:

  • [core][autoscaler] Fix env variable overwrite not able to be used if the command itself uses the env #37675

Ray Serve

🔨 Fixes:

  • [serve] Cherry-pick Serve enum to_proto fixes for Python 3.11 #37660

Ray Air

📖Documentation:

  • [air][doc] Update docs to reflect head node syncing deprecation #37475

Ray-2.6.0

21 Jul 02:53
0db82e3
Compare
Choose a tag to compare

Release Highlights

  • Serve: Better streaming support -- In this release, Support for HTTP streaming response and WebSockets is now on by default. Also, @serve.batch-decorated methods can stream responses.
  • Train and Tune: Users are now expected to provide cloud storage or NFS path for distributed training or tuning jobs instead of a local path. This means that results written to different worker machines will not be directly synced to the head node. Instead, this will raise an error telling you to switch to one of the recommended alternatives: cloud storage or NFS. Please see #37177 if you have questions.
  • Data: We are introducing a new streaming integration of Ray Data and Ray Train. This allows streaming data ingestion for model training, and enables per-epoch data preprocessing. The DatasetPipeline API is also being deprecated in favor of Dataset with streaming execution.
  • RLlib: Public alpha release for the new multi-gpu Learner API that is less complex and more powerful compared to our previous solution (blogpost). This is used under PPO algorithm by default.

Ray Libraries

Ray AIR

🎉 New Features:

  • Added support for restoring Results from local trial directories. (#35406)

💫 Enhancements:

🔨 Fixes:

  • Pass on KMS-related kwargs for s3fs (#35938)
  • Fix infinite recursion in log redirection (#36644)
  • Remove temporary checkpoint directories after restore (#37173)
  • Removed actors that haven't been started shouldn't be tracked (#36020)
  • Fix bug in execution for actor re-use (#36951)
  • Cancel pg.ready() task for pending trials that end up reusing an actor (#35748)
  • Add case for Dict[str, np.array] batches in DummyTrainer read bytes calculation (#36484)

📖 Documentation:

  • Remove experimental features page, add github issue instead (#36950)
  • Fix batch format in dreambooth example (#37102)
  • Fix Checkpoint.from_checkpoint docstring (#35793)

🏗 Architecture refactoring:

  • Remove deprecated mlflow and wandb integrations (#36860, #36899)
  • Move constants from tune/results.py to air/constants.py (#35404)
  • Clean up a few checkpoint related things. (#35321)

Ray Data

🎉 New Features:

  • New streaming integration of Ray Data and Ray Train. This allows streaming data ingestion for model training, and enables per-epoch data preprocessing. (#35236)
  • Enable execution optimizer by default (#36294, #35648, #35621, #35952)
  • Deprecate DatasetPipeline (#35753)
  • Add Dataset.unique() (#36655, #36802)
  • Add option for parallelizing post-collation data batch operations in DataIterator.iter_batches() (#36842) (#37260)
  • Enforce strict mode batch format for DataIterator.iter_batches() (#36686)
  • Remove ray.data.range_arrow() (#35756)

💫 Enhancements:

  • Optimize block prefetching (#35568)
  • Enable isort for data directory (#35836)
  • Skip writing a file for an empty block in Dataset.write_datasource() (#36134)
  • Remove shutdown logging from StreamingExecutor (#36408)
  • Spread map task stages by default for arg size <50MB (#36290)
  • Read->SplitBlocks to ensure requested read parallelism is always met (#36352)
  • Support partial execution in Dataset.schema() with new execution plan optimizer (#36740)
  • Propagate iter stats for Dataset.streaming_split() (#36908)
  • Cache the computed schema to avoid re-executing (#37103)

🔨 Fixes:

  • Support sub-progress bars on AllToAllOperators with optimizer enabled (#34997)
  • Fix DataContext not propagated properly for Dataset.streaming_split() operator
  • Fix edge case in empty bundles with Dataset.streaming_split() (#36039)
  • Apply Arrow table indices mapping on HuggingFace Dataset prior to reading into Ray Data (#36141)
  • Fix issues with combining use of Dataset.materialize() and Dataset.streaming_split() (#36092)
  • Fix quadratic slowdown when locally shuffling tensor extension types (#36102)
  • Make sure progress bars always finish at 100% (#36679)
  • Fix wrong output order of Dataset.streaming_split() (#36919)
  • Fix the issue that StreamingExecutor is not shutdown when the iterator is not fully consumed (#36933)
  • Calculate stage execution time in StageStatsSummary from BlockMetadata (#37119)

📖 Documentation:

  • Standardize Data API ref (#36432, #36937)
  • Docs for working with PyTorch (#36880)
  • Split "Consuming data" guide (#36121)
  • Revise "Loading data" (#36144)
  • Consolidate Data user guides (#36439)

🏗 Architecture refactoring:

  • Remove simple blocks representation (#36477)

Ray Train

🎉 New Features:

  • LightningTrainer support DeepSpeedStrategy (#36165)

💫 Enhancements:

  • Unify Lightning and AIR CheckpointConfig (#36368)
  • Add support for custom pipeline class in TransformersPredictor (#36494)

🔨 Fixes:

  • Fix Deepspeed device ranks check in Lightning 2.0.5 (#37387)
  • Clear stale lazy checkpointing markers on all workers. (#36291)

📖 Documentation:

  • Migrate Ray Train code-block to testcode. (#36483)

🏗 Architecture refactoring:

Ray Tune

🔨 Fixes:

  • Optuna: Update distributions to use new APIs (#36704)
  • BOHB: Fix nested bracket processing (#36568)
  • Hyperband: Fix scheduler raising an error for good PENDING trials (#35338)
  • Fix param space placeholder injection for numpy/pandas objects (#35763)
  • Fix result restoration with Ray Client (#35742)
  • Fix trial runner/controller whitelist attributes (#35769)

📖 Documentation:

  • Remove missing example from Tune "Other examples" (#36691)

🏗 Architecture refactoring:

  • Remove tune/automl (#35557)
  • Remove hard-deprecated modules from structure refactor (#36984)
  • Remove deprecated mlflow and wandb integrations (#36860, #36899)
  • Move constants from tune/results.py to air/constants.py (#35404)
  • Deprecate redundant syncing related parameters (#36900)
  • Deprecate legacy modules in ray.tune.integration (#35160)

Ray Serve

💫 Enhancements:

  • Support for HTTP streaming response and WebSockets is now on by default.
  • @serve.batch-decorated methods can stream responses.
  • @serve.batch settings can be reconfigured dynamically.
  • Ray Serve now uses “power of two random choices” routing. This improves enforcement of max_concurrent_queries and tail latencies under load.

🔨 Fixes:

  • Fixed the bug previously unable to use a custom module named after “utils”.
  • Fixed serve downscaling issue by adding a new draining state to the http proxy. This helps http proxies to not take new requests when there are no replicas on the node and prevents interruption on the ongoing requests when the node is downscaled. Also, enables downscaling to happen when the requests use Ray’s object store which is blocking downscaling of the node.
  • Fixed non-atomic shutdown logic. Serve shutdown will be run in the background and not require the client to wait for the shutdown to complete. And won’t be interrupted when the client is force killed.

RLlib

🎉 New Features:

  • Public alpha release for the new multi-gpu Learner API that is less complex and more powerful than the old training stack (blogpost). This is used under PPO algorithm by default.
  • Added RNN support on the new RLModule API
  • Added TF-version of DreamerV3 (link). The comprehensive results will be published soon.
  • Added support for torch 2.x compile method in sampling from environment

💫 Enhancements:

  • Added an Example on how to do pretraining with BC and then continuing finetuning with PPO (example)
  • RLlib deprecation Notices (algorithm/, evaluation/, execution/, models/jax/) (#36826)
  • Enable eager_tracing=True by default. (#36556)

🔨 Fixes:

  • Fix bug in Multi-Categorical distribution. It should use logp and not log_p. (#36814)
  • Fix LSTM + Connector bug: StateBuffer restarting states on every in_eval() call. (#36774)

🏗 Architecture refactoring:

  • Multi-GPU Learner API

Ray Core

🎉 New Features:

  • [Core][Streaming Generator] Cpp interfaces and implementation (#35291)
  • [Core][Streaming Generator] Streaming Generator. Support Core worker APIs + cython generator interface. (#35324)
  • [Core][Streaming Generator] Streaming Generator. E2e integration (#35325)
  • [Core][Streaming Generator] Support async actor and async generator interface. (#35584)
  • [Core][Streaming Generator] Streaming Generator. Support the basic retry/lineage reconstruction (#35768)
  • [Core][Streaming Generator] Allow to raise an exception to avoid check failures. (#35766)
  • [Core][Streaming Generator] Fix a reference leak when a stream is deleted with out of order writes. (#35591)
  • [Core][Streaming Generator] Fix a reference leak when pinning requests are received after refs are consumed. (#35712)
  • [Core][Streaming Generator] Handle out of order report when retry (#36069)
  • [Core][Streaming Generator] Make it compatible with wait (#36071)
  • [Core][Streaming Generator] Remove busy waiting (#36070)
  • [Core][Autoscaler v2] add test for ...
Read more

Ray-2.5.1

21 Jun 18:09
a03efd9
Compare
Choose a tag to compare

The Ray 2.5.1 patch release adds wheels for MacOS for Python 3.11.
It also contains fixes for multiple components, along with fixes for our documentation.

Ray Train

🔨 Fixes:

  • Don't error on eventual success when running with auto-recovery (#36266)

Ray Core

🎉 New Features:

  • Build Python wheels on Mac OS for Python 3.11 (#36373)

🔨 Fixes:

  • [Autoscaler] Fix a bug that can cause undefined behavior when clusters attempt to scale up aggressively. (#36241)
  • Fix mypy error: Module "ray" does not explicitly export attribute "remote" (#36356)

Ray-2.5.0

08 Jun 16:59
586c376
Compare
Choose a tag to compare

The Ray 2.5 release features focus on a number of enhancements and improvements across the Ray ecosystem, including:

  • Training LLMs with Ray Train: New support for checkpointing distributed models, and Pytorch Lightning FSDP to enable training large models on Ray Train’s LightningTrainer
  • LLM applications with Ray Serve & Core: New support for streaming responses and model multiplexing
  • Improvements to Ray Data: In 2.5, strict mode is enabled by default. This means that schemas are required for all Datasets, and standalone Python objects are no longer supported. Also, the default batch format is fixed to NumPy, giving better performance for batch inference.
  • RLlib enhancements: New support for multi-gpu training, along with ray-project/rllib-contrib to contain the community contributed algorithms
  • Core enhancements: Enable new feature of lightweight resource broadcasting to improve reliability and scalability. Add many enhancements for Core reliability, logging, scheduler, and worker process.

Ray Libraries

Ray AIR

💫Enhancements:

  • Experiment restore stress tests (#33706)
  • Context-aware output engine
    • Add parameter columns to status table (#35388)
    • Context-aware output engine: Add docs, experimental feature docs, prepare default on (#35129)
    • Fix trial status at end (more info + cut off) (#35128)
    • Improve leaked mentions of Tune concepts (#35003)
    • Improve passed time display (#34951)
    • Use flat metrics in results report, use Trainable._progress_metrics (#35035)
    • Print experiment information at experiment start (#34952)
    • Print single trial config + results as table (#34788)
    • Print out worker ip for distributed train workers. (#33807)
    • Minor fix to print configuration on start. (#34575)
    • Check air_verbosity against None. (#33871)
    • Better wording for empty config. (#33811)
  • Flatten config and metrics before passing to mlflow (#35074)
  • Remote_storage: Prefer fsspec filesystems over native pyarrow (#34663)
  • Use filesystem wrapper to exclude files from upload (#34102)
  • GCE test variants for air_benchmark and air_examples (#34466)
  • New storage path configuration
    • Add RunConfig.storage_path to replace SyncConfig.upload_dir and RunConfig.local_dir. (#33463)
    • Use Ray storage URI as default storage path, if configured [no_early_kickoff] (#34470)
    • Move to new storage_path API in tests and examples (#34263)

🔨 Fixes:

  • Store unflattened metrics in _TrackedCheckpoint (#35658) (#35706)
  • Fix test_tune_torch_get_device_gpu race condition (#35004)
  • Deflake test_e2e_train_flow.py (#34308)
  • Pin deepspeed version for now to unblock ci. (#34406)
  • Fix AIR benchmark configuration link failure. (#34597)
  • Fix unused config building function in lightning MNIST example.

📖Documentation:

  • Change doc occurrences of ray.data.Dataset to ray.data.Datastream (#34520)
  • DreamBooth example: Fix code for batch size > 1 (#34398)
  • Synced tabs in AIR getting started (#35170)
  • New Ray AIR link for try it out (#34924)
  • Correctly Render the Enumerate Numbers in convert_torch_code_to_ray_air (#35224)

Ray Data Processing

🎉 New Features:

  • Implement Strict Mode and enable it by default.
  • Add column API to Dataset (#35241)
  • Configure progress bars via DataContext (#34638)
  • Support using concurrent actors for ActorPool (#34253)
  • Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)

💫Enhancements:

  • Improve map batches error message for strict mode migration (#35368)
  • Improve docstring and warning message for from_huggingface (#35206)
  • Improve notebook widget display (#34359)
  • Implement some operator fusion logic for the new backend (#35178 #34847)
  • Use wait based prefetcher by default (#34871)
  • Implement limit physical operator (#34705 #34844)
  • Require compute spec to be explicitly spelled out #34610
  • Log a warning if the batch size is misconfigured in a way that would grossly reduce parallelism for actor pool. (#34594)
  • Add alias parameters to the aggregate function, and add quantile fn (#34358)
  • Improve repr for Arrow Table and pandas types (#34286 #34502)
  • Defer first block computation when reading a Datasource with schema information in metadata (#34251)
  • Improve handling of KeyboardInterrupt (#34441)
  • Validate aggregation key in Aggregate LogicalOperator (#34292)
  • Add usage tag for which block formats are used (#34384)
  • Validate sort key in Sort LogicalOperator (#34282)
  • Combine_chunks before chunking pyarrow.Table block into batches (#34352)
  • Use read stage name for naming Data-read tasks on Ray Dashboard (#34341)
  • Update path expansion warning (#34221)
  • Improve state initialization for ActorPoolMapOperator (#34037)

🔨 Fixes:

  • Fix ipython representation (#35414)
  • Fix bugs in handling of nested ndarrays (and other complex object types) (#35359)
  • Capture the context when the dataset is first created (#35239)
  • Cooperatively exit producer threads for iter_batches (#34819)
  • Autoshutdown executor threads when deleted (#34811)
  • Fix backpressure when reading directly from input datasource (#34809)
  • Fix backpressure handling of queued actor pool tasks (#34254)
  • Fix row count after applying filter (#34372)
  • Remove unnecessary setting of global logging level to INFO when using Ray Data (#34347)
  • Make sure the tf and tensor iteration work in dataset pipeline (#34248)
  • Fix '_unwrap_protocol' for Windows systems (#31296)

📖Documentation:

  • Add batch inference object detection example (#35143)
  • Refine batch inference doc (#35041)

Ray Train

🎉 New Features:

  • Experimental support for distributed checkpointing (#34709)

💫Enhancements:

  • LightningTrainer: Enable prog bar (#35350)
  • LightningTrainer enable checkpoint full dict with FSDP strategy (#34967)
  • Support FSDP Strategy for LightningTrainer (#34148)

🔨 Fixes:

  • Fix HuggingFace -> Transformers wrapping logic (#35276, #35284)
  • LightningTrainer always resumes from the latest AIR checkpoint during restoration. (#35617) (#35791)
  • Fix lightning trainer devices setting (#34419)
  • TorchCheckpoint: Specifying pickle_protocol in torch.save() (#35615) (#35790)

📖Documentation:

  • Improve visibility of Trainer restore and stateful callback restoration (#34350)
  • Fix rendering of diff code-blocks (#34355)
  • LightningTrainer Dolly V2 FSDP Fine-tuning Example (#34990)
  • Update LightningTrainer MNIST example. (#34867)
  • LightningTrainer Advanced Example (#34082, #34429)

🏗 Architecture refactoring:

  • Restructure ray.train HuggingFace modules (#35270) (#35488)
  • rename _base_dataset to _base_datastream (#34423)

Ray Tune

🎉 New Features:

  • Ray Tune's new execution path is now enabled per default (#34840, #34833)

💫Enhancements:

  • Make `Tuner.restore(trainable=...) a required argument (#34982)
  • Enable tune.ExperimentAnalysis to pull experiment checkpoint files from the cloud if needed (#34461)
  • Add support for nested hyperparams in PB2 (#31502)
  • Release test for durable multifile checkpoints (#34860)
  • GCE variants for remaining Tune tests (#34572)
  • Add tune frequent pausing release test. (#34501)
  • Add PyArrow to ray[tune] dependencies (#34397)
  • Fix new execution backend for BOHB (#34828)
  • Add tune frequent pausing release test. (#34501)

🔨 Fixes:

  • Set config on trial restore (#35000)
  • Fix test_tune_torch_get_device_gpu race condition (#35004)
  • Fix a typo in tune/execution/checkpoint_manager state serialization. (#34368)
  • Fix tune_scalability_network_overhead by adding --smoke-test. (#34167)
  • Fix lightning_gpu_tune_.* release test (#35193)

📖Documentation:

  • Fix Tune tutorial (#34660)
  • Fix typo in Tune restore guide (#34247)

🏗 Architecture refactoring:

  • Use Ray-provided tabulate package (#34789)

Ray Serve

🎉 New Features:

  • Add support for json logging format.(#35118)
  • Add experimental support for model multiplexing.(#35399, #35326)
  • Added experimental support for HTTP StreamingResponses. (#35720)
  • Add support for application builders & arguments (#34584)

💫Enhancements:

  • Add more bucket size for histogram metrics. (#35242).
  • Add route information into the custom metrics. (#35246)
  • Add HTTPProxy details to Serve Dashboard UI (#35159)
  • Add status_code to http qps & latency (#35134)
  • Stream Serve logs across different drivers (#35070)
  • Add health checking for http proxy actors (#34944)
  • Better surfacing of errors in serve status (#34773)
  • Enable TLS on gRPCIngress if RAY_USE_TLS is on (#34403)
  • Wait until replicas have finished recovering (with timeout) to broadcast LongPoll updates (#34675)
  • Replace ClassNode and FunctionNode with Application in top-level Serve APIs (#34627)

🔨 Fixes:

  • Set app_msg to empty string by default (#35646)
  • Fix dead replica counts in the stats. (#34761)
  • Add default app name (#34260)
  • gRPC Deployment schema check & minor improvements (#34210)

📖Documentation:

  • Clean up API reference and various docstrings (#34711)
  • Clean up RayServeHandle and RayServeSyncHandle docstrings & typing (#34714)

RLlib

🎉 New Features:

  • Migrating approximately ~25 of the 30 algorithms from RLlib into rllib_contrib. You can review the REP here. This release we have covered A3C and MAML.
  • The APPO / IMPALA and PPO are all moved to the new Learner and RLModule stack.
  • The RLModule now supports Checkpointing.(#34717 #34760)

💫Enhancements:

  • Intro...
Read more