Releases: ray-project/ray
Ray-2.44.1
Under screen-lit skies
A ray of bliss in each patch
Joy at any scale
Ray-2.44.0
Release Highlights
- This release features Ray Compiled Graph (beta). Ray Compiled Graph gives you a classic Ray Core-like API, but with (1) less than 50us system overhead for workloads that repeatedly execute the same task graph; and (2) native support for GPU-GPU communication via NCCL. Ray Compiled Graph APIs simplify high-performance multi-GPU workloads such as LLM inference and training. The beta release refines the API, enhances stability, and adds or improves features like visualization, profiling and experimental GPU compute/computation overlap. For more information, refer to Ray documentation: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html
- The experimental Ray Workflows library has been deprecated and will be removed in a future version of Ray. Ray Workflows has been marked experimental since its inception and hasn’t been maintained due to the Ray team focusing on other priorities. If you are using Ray Workflows, we recommend pinning your Ray version to 2.44.
Ray Libraries
Ray Data
🎉 New Features:
- Add Iceberg write support through pyiceberg (#50590)
- [LLM] Various feature enhancements to Ray Data LLM, including LoRA support #50804 and structured outputs #50901
💫 Enhancements:
- Add dataset/operator state, progress, total metrics (#50770)
- Make chunk combination threshold configurable (#51200)
- Store average memory use per task in OpRuntimeMetrics (#51126)
- Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks (#51238)
- Append-mode API for preprocessors -- #50848, #50847, #50642, #50856, #50584. Note that vectorizers and hashers now output a single column instead 1 column per feature. In the near future, we will be graduating preprocessors to beta.
🔨 Fixes:
- Fixing Map Operators to avoid unconditionally overriding generator's back-pressure configuration (#50900)
- Fix filter expr equating negative numbers (#50932)
- Fix error message for
override_num_blocks
when reading from a HuggingFace Dataset (#50998) - Make num_blocks in repartition optional (#50997)
- Always pin the seed when doing file-based random shuffle (#50924)
- Fix
StandardScaler
to handleNaN
stats (#51281)
Ray Train
🎉 New Features:
💫 Enhancements:
- Folded v2.XGBoostTrainer API into the public trainer class as an alternate constructor (#50045)
- Created a default ScalingConfig if one is not provided to the trainer (#51093)
- Improved TrainingFailedError message (#51199)
- Utilize FailurePolicy factory (#51067)
🔨 Fixes:
- Fixed trainer import deserialization when captured within a Ray task (#50862)
- Fixed serialize import test for Python 3.12 (#50963)
- Fixed RunConfig deprecation message in Tune being emitted in trainer.fit usage (#51198)
📖 Documentation:
- [Train V2] Updated API references (#51222)
- [Train V2] Updated persistent storage guide (#51202)
- [Train V2] Updated user guides for metrics, checkpoints, results, and experiment tracking (#51204)
- [Train V2] Added updated Train + Tune user guide (#51048)
- [Train V2] Added updated fault tolerance user guide (#51083)
- Improved HF Transformers example (#50896)
- Improved Train DeepSpeed example (#50906)
- Use correct mean and standard deviation norm values in image tutorials (#50240)
🏗 Architecture refactoring:
- Deprecated Torch AMP wrapper utilities (#51066)
- Hid private functions of train context to avoid abuse (#50874)
- Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
- Moved library usage tests out of core (#51161)
Ray Tune
📖 Documentation:
- Various improvements to Tune Pytorch CIFAR tutorial (#50316)
- Various improvements to the Ray Tune XGBoost tutorial (#50455)
- Various enhancements to Tune Keras example (#50581)
- Minor improvements to Hyperopt tutorial (#50697)
- Various improvements to LightGBM tutorial (#50704)
- Fixed non-runnable Optuna tutorial (#50404)
- Added documentation for Asynchronous HyperBand Example in Tune (#50708)
- Replaced reuse actors example with a fuller demonstration (#51234)
- Fixed broken PB2/RLlib example (#51219)
- Fixed typo and standardized equations across the two APIs (#51114)
- Improved PBT example (#50870)
- Removed broken links in documentation (#50995, #50996)
🏗 Architecture refactoring:
- Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
- Moved library usage tests out of core (#51161)
Ray Serve
🎉 New Features:
💫 Enhancements:
- Clean up shutdown behavior of serve (#51009)
- Add
additional_log_standard_attrs
to serve logging config (#51144) - [LLM] remove
asyncache
andcachetools
from dependencies (#50806) - [LLM] remove
backoff
dependency (#50822) - [LLM] Remove
asyncio_timeout
fromray[llm]
deps on python<3.11 (#50815) - [LLM] Made JSON validator a singleton and
jsonref
packages lazy imported (#50821) - [LLM] Reuse
AutoscalingConfig
andDeploymentConfig
from Serve (#50871) - [LLM] Use
pyarrow
FS for cloud remote storage interaction (#50820) - [LLM] Add usage telemetry for
serve.llm
(#51221)
🔨 Fixes:
- Exclude redirects from request error count (#51130)
- [LLM] Fix the wrong
device_capability
issue in vllm on quantized models (#51007) - [LLM] add
gen-config
related data file to the package (#51347)
📖 Documentation:
- [LLM] Fix quickstart serve LLM docs (#50910)
- [LLM] update
build_openai_app
to include yaml example (#51283) - [LLM] remove old vllm+serve doc (#51311)
RLlib
💫 Enhancements:
- APPO/IMPALA accelerate:
- Unify namings for actor managers' outstanding in-flight requests metrics. (#51159)
- Add timers to env step, forward pass, and complete connector pipelines runs. (#51160)
🔨 Fixes:
📖 Documentation:
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Enhanced
uv
support (#51233)
💫 Enhancements:
- Made infeasible task errors much more obvious (#45909)
- Log rotation for workers, runtime env agent, and dashboard agent (#50759, #50877, #50909)
- Support customizing gloo timeout (#50223)
- Support torch profiling in Compiled Graph (#51022)
- Change default tensor deserialization in Compiled Graph (#50778)
- Use current node id if no node is specified on ray drain-node (#51134)
🔨 Fixes:
- Fixed an issue where the raylet continued to have high CPU overhead after a job was terminated ([...
Ray-2.43.0
Highlights
- This release features new modules in Ray Serve and Ray Data for integration with large language models, marking the first step of addressing #50639. Existing Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine. In this release, we offer APIs for both batch inference and serving of LLMs within Ray in
ray.data.llm
andray.serve.llm
. See the below notes for more details. These APIs are marked as alpha -- meaning they may change in future releases without a deprecation period. - Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the
RAY_TRAIN_V2_ENABLED=1
environment variable. See the migration guide for more information. - A new integration with
uv run
that allows easily specifying Python dependencies for both driver and workers in a consistent way and enables quick iterations for development of Ray applications (#50160, 50462), check out our blog post
Ray Libraries
Ray Data
🎉 New Features:
- Ray Data LLM: We are introducing a new module in Ray Data for batch inference with LLMs (currently marked as alpha). It offers a new
Processor
abstraction that interoperates with existing Ray Data pipelines. This abstraction can be configured two ways:- Using the
vLLMEngineProcessorConfig
, which configures vLLM to load model replicas for high throughput model inference - Using the
HttpRequestProcessorConfig
, which sends HTTP requests to an OpenAI-compatible endpoint for inference. - Documentation for these features can be found here.
- Using the
- Implement accurate memory accounting for
UnionOperator
(#50436) - Implement accurate memory accounting for all-to-all operations (#50290)
💫 Enhancements:
- Support class constructor args for filter() (#50245)
- Persist ParquetDatasource metadata. (#50332)
- Rebasing
ShufflingBatcher
ontotry_combine_chunked_columns
(#50296) - Improve warning message if required dependency isn't installed (#50464)
- Move data-related test logic out of core tests directory (#50482)
- Pass executor as an argument to ExecutionCallback (#50165)
- Add operator id info to task+actor (#50323)
- Abstracting common methods, removing duplication in
ArrowBlockAccessor
,PandasBlockAccessor
(#50498) - Warn if map UDF is too large (#50611)
- Replace
AggregateFn
withAggregateFnV2
, cleaning up Aggregation infrastructure (#50585) - Simplify Operator.repr (#50620)
- Adding in
TaskDurationStats
andon_execution_step
callback (#50766) - Print Resource Manager stats in release tests (#50801)
🔨 Fixes:
- Fix invalid escape sequences in
grouped_data.py
docstrings (#50392) - Deflake
test_map_batches_async_generator
(#50459) - Avoid memory leak with
pyarrow.infer_type
on datetime arrays (#50403) - Fix parquet partition cols to support tensors types (#50591)
- Fixing aggregation protocol to be appropriately associative (#50757)
📖 Documentation:
- Remove "Stable Diffusion Batch Prediction with Ray Data" example (#50460)
Ray Train
🎉 New Features:
- Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the
RAY_TRAIN_V2_ENABLED=1
environment variable. See the migration guide for more information.
💫 Enhancements:
- Add a training ingest benchmark release test (#50019, #50299) with a fault tolerance variant (#50399)
- Add telemetry for Trainer usage in V2 (#50321)
- Add pydantic as a
ray[train]
extra install (#46682) - Add state tracking to train v2 to make run status, run attempts, and training worker metadata observable (#50515)
🔨 Fixes:
- Increase doc test parallelism (#50326)
- Disable TF test for py312 (#50382)
- Increase test timeout to deflake (#50796)
📖 Documentation:
- Add missing xgboost pip install in example (#50232)
🏗 Architecture refactoring:
- Add deprecation warnings pointing to a migration guide for Ray Train V2 (#49455, #50101, #50322)
- Refactor internal Train controller state management (#50113, #50181, #50388)
Ray Tune
🔨 Fixes:
- Fix worker node failure test (#50109)
📖 Documentation:
- Update all doc examples off of ray.train imports (#50458)
- Update all ray/tune/examples off of ray.train imports (#50435)
- Fix typos in persistent storage guide (#50127)
- Remove Binder notebook links in Ray Tune docs (#50621)
🏗 Architecture refactoring:
- Update RLlib to use ray.tune imports instead of ray.air and ray.train (#49895)
Ray Serve
🎉 New Features:
- Ray Serve LLM: We are introducing a new module in Ray Serve to easily integrate open source LLMs in your Ray Serve deployment, currently marked as alpha. This opens up a powerful capability of composing complex applications with multiple LLMs, which is a use case in emerging applications like agentic workflows. Ray Serve LLM offers a couple core components, including:
VLLMService
: A prebuilt deployment that offers a full-featured vLLM engine integration, with support for features such as LoRA multiplexing and multimodal language models.LLMRouter
: An out-of-the-box OpenAI compatible model router that can route across multiple LLM deployments.- Documentation can be found at https://docs.ray.io/en/releases-2.43.0/serve/llm/overview.html
💫 Enhancements:
- Add
required_resources
to REST API (#50058)
🔨 Fixes:
- Fix batched requests hanging after cancellation (#50054)
- Properly propagate backpressure error (#50311)
RLlib
🎉 New Features:
- Added env vectorization support for multi-agent (new API stack). (#50437)
💫 Enhancements:
- APPO/IMPALA various acceleration efforts. Reached 100k ts/sec on Atari benchmark with 400 EnvRunners and 16 (multi-node) GPU Learners: #50760, #50162, #50249, #50353, #50368, #50379, #50440, #50477, #50527, #50528, #50600, #50309
- Offline RL:
🔨 Fixes:
- Fix SPOT preemption tolerance for large AlgorithmConfig: Pass by reference to RolloutWorker...
Ray-2.42.1
Ray-2.42.0
Ray Libraries
Ray Data
🎉 New Features:
- Added read_audio and read_video (#50016)
💫 Enhancements:
- Optimized multi-column groupbys (#45667)
- Included Ray user-agent in BigQuery client construction (#49922)
🔨 Fixes:
- Fixed bug that made read tasks non-deterministic (#49897)
🗑️ Deprecations:
- Deprecated num_rows_per_file in favor of min_rows_per_file (#49978)
Ray Train
💫 Enhancements:
- Add Train v2 user-facing callback interface (#49819)
- Add TuneReportCallback for propagating intermediate Train results to Tune (#49927)
Ray Tune
📖 Documentation:
- Fix BayesOptSearch docs (#49848)
Ray Serve
💫 Enhancements:
- Cache metrics in replica and report on an interval (#49971)
- Cache expensive calls to inspect.signature (#49975)
- Remove extra pickle serialization for gRPCRequest (#49943)
- Shared LongPollClient for Routers (#48807)
- DeploymentHandle API is now stable (#49840)
🔨 Fixes:
- Fix batched requests hanging after request cancellation bug (#50054)
RLlib
💫 Enhancements:
- Add metrics to replay buffers. (#49822)
- Enhance node-failure tolerance (new API stack). (#50007)
- MetricsLogger cleanup throughput logic. (#49981)
- Split AddStates... connectors into 2 connector pieces (
AddTimeDimToBatchAndZeroPad
andAddStatesFromEpisodesToBatch
) (#49835)
🔨 Fixes:
- Old API stack IMPALA/APPO: Re-introduce mixin-replay-buffer pass, even if
replay-ratio=0
(fixes a memory leak). (#49964) - Fix MetricsLogger race conditions. (#49888)
- APPO/IMPALA: Bug fix for > 1 Learner actor. (#49849)
📖 Documentation:
- New MetricsLogger API rst page. (#49538)
- Move "new API stack" info box right below page titles for better visibility. (#49921)
- Add example script for how to log custom metrics in
training_step()
. (#49976) - Enhance/redo autoregressive action distribution example. (#49967)
- Make the "tiny CNN" example RLModule run with APPO (by implementing
TargetNetAPI
) (#49825)
Ray Core and Ray Clusters
Ray Core
💫 Enhancements:
- Only get single node info rather then all when needed (#49727)
- Introduce with_tensor_transport API (#49753)
🔨 Fixes:
- Fix tqdm manager thread safe #50040
Ray Clusters
🔨 Fixes:
- Fix token expiration for ray autoscaler (#48481)
Thanks
Thank you to everyone who contributed to this release! 🥳
@wingkitlee0, @saihaj, @win5923, @justinvyu, @kevin85421, @edoakes, @cristianjd, @rynewang, @richardliaw, @LeoLiao123, @alexeykudinkin, @simonsays1980, @aslonnie, @ruisearch42, @pcmoritz, @fscnick, @bveeramani, @mattip, @till-m, @tswast, @ujjawal-khare, @wadhah101, @nikitavemuri, @akshay-anyscale, @srinathk10, @zcin, @dayshah, @dentiny, @LydiaXwQ, @matthewdeng, @JoshKarpel, @MortalHappiness, @sven1977, @omatthew98
Ray-2.41.0
Highlights
- Major update of RLlib docs and example scripts for the new API stack.
Ray Libraries
Ray Data
🎉 New Features:
- Expression support for filters (#49016)
- Support
partition_cols
inwrite_parquet
(#49411) - Feature: implement multi-directional sort over Ray Data datasets (#49281)
💫 Enhancements:
- Use dask 2022.10.2 (#48898)
- Clarify schema validation error (#48882)
- Raise
ValueError
when the data sort key isNone
(#48969) - Provide more messages when webdataset format is error (#48643)
- Upgrade Arrow version from 17 to 18 (#48448)
- Update
hudi
version to 0.2.0 (#48875) webdataset
: expand JSON objects into individual samples (#48673)- Support passing kwargs to map tasks. (#49208)
- Add
ExecutionCallback
interface (#49205) - Add seed for read files (#49129)
- Make
select_columns
andrename_columns
use Project operator (#49393)
🔨 Fixes:
- Fix partial function name parsing in
map_groups
(#48907) - Always launch one task for
read_sql
(#48923) - Reimplement of fix memory pandas (#48970)
webdataset
: flatten return args (#48674)- Handle
numpy > 2.0.0
behaviour in_create_possibly_ragged_ndarray
(#48064) - Fix
DataContext
sealing for multiple datasets. (#49096) - Fix
to_tf
forList
types (#49139) - Fix type mismatch error while mapping nullable column (#49405)
- Datasink: support passing write results to
on_write_completes
(#49251) - Fix
groupby
hang when value containsnp.nan
(#49420) - Fix bug where
file_extensions
doesn't work with compound extensions (#49244) - Fix map operator fusion when concurrency is set (#49573)
Ray Train
🎉 New Features:
- Output JSON structured log files for system and application logs (#49414)
- Add support for AMD ROCR_VISIBLE_DEVICES (#49346)
💫 Enhancements:
🏗 Architecture refactoring:
- LightGBM: Rewrite
get_network_params
implementation (#49019)
Ray Tune
🎉 New Features:
- Update
optuna_search
to allow users to configure optuna storage (#48547)
🏗 Architecture refactoring:
Ray Serve
💫 Enhancements:
- Improved request_id generation to reduce proxy CPU overhead (#49537)
- Tune GC threshold by default in proxy (#49720)
- Use
pickle.dumps
for faster serialization fromproxy
toreplica
(#49539)
🔨 Fixes:
- Handle nested ‘=’ in serve run arguments (#49719)
- Fix bug when
ray.init()
is called multiple times with differentruntime_envs
(#49074)
🗑️ Deprecations:
- Adds a warning that the default behavior for sync methods will change in a future release. They will be run in a threadpool by default. You can opt into this behavior early by setting
RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1
. (#48897)
RLlib
🎉 New Features:
- Add support for external Envs to new API stack: New example script and custom tcp-capable EnvRunner. (#49033)
💫 Enhancements:
- Offline RL:
- APPO/IMPALA acceleration (new API stack):
- Add support for
AggregatorActors
per Learner. (#49284) - Auto-sleep time AND thread-safety for MetricsLogger. (#48868)
- Activate APPO cont. actions release- and CI tests (HalfCheetah-v1 and Pendulum-v1 new in
tuned_examples
). (#49068) - Add "burn-in" period setting to the training of stateful RLModules. (#49680)
- Add support for
- Callbacks API: Add support for individual lambda-style callbacks. (#49511)
- Other enhancements: #49687, #49714, #49693, #49497, #49800, #49098
📖 Documentation:
- New example scripts:
- New/rewritten html pages:
- Rewrite checkpointing page. (#49504)
- New scaling guide. (#49528)
- New callbacks page. (#49513)
- Rewrite
RLModule
page. (#49387) - New AlgorithmConfig page and redo
package_ref
page for algo configs. (#49464) - Rewrite offline RL page. (#48818)
- Rewrite “key concepts" rst page. (#49398)
- Rewrite RL environments pages. (#49165, #48542)
- Fixes and enhancements: #49465, #49037, #49304, #49428, #49474, #49399, #49713, #49518
🔨 Fixes:
- Add
on_episode_created
callback to SingleAgentEnvRunner. (#49487) - Fix
train_batch_size_per_learner
problems. (#49715) - Various other fixes: #48540, #49363, #49418, #49191
🏗 Architecture refactoring:
- RLModule: Introduce
Default[algo]RLModule
classes (#49366, #49368) - Remove RLlib dependencies from setup.py; add
ormsgpack
(#49489)
🗑️ Deprecations:
Ray Core and Ray Clusters
Ray Core
💫 Enhancements:
- Add
task_name
,task_function_name
andactor_name
in Structured Logging (#48703) - Support redis/valkey authentication with username (#48225)
- Add v6e TPU Head Resource Autoscaling Support (#48201)
- compiled graphs: Support all driver and actor read combinations (#48963)
- compiled graphs: Add ascii based CG visualization (#48315)
- compiled graphs: Add ray[cg] pip install option (#49220)
- Allow uv cache at installation (#49176)
- Support != Filter in GCS for Task State API (#48983)
- compiled graphs: Add CPU-based NCCL communicator for development (#48440)
- Support gcs and raylet log rotation (#48952)
- compiled graphs: Support
nsight.nvtx
profiling (#49392)
🔨 Fixes:
- autoscaler: Health check logs are not visible in the autoscaler container's stdout (#48905)
- Only publish
WORKER_OBJECT_EVICTION
when the object is out of scope or manually freed (#47990) - autoscaler: Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state (#48909)
- autoscaler: Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 (#48519)
- compiled graphs: Fix the missing dependencies when num_returns is used (#49118)
- autoscaler: Fuse scaling requests together to avoid overloading the Kubernetes API server (#49150)
- Fix bug to support S3 pre-signed url for
.whl
file (#48560) - Fix data race on gRPC client context (#49475)
- Make sure draining node is not selected for scheduling (#49517)
Ray Clusters
💫 Enhancements:
- Azure: Enable accelerated networking as a flag in azure vms (#47988)
📖 Documentation:
- Kuberay: Logging: Add Fluent Bit
DaemonSet
and Grafana Loki to "Persist KubeRay Operator Logs" (#48725) - Kuberay: Logging: Specify the Helm chart version in "Persist KubeRay Operator Logs" (#48937)
Dashboard
💫 Enhancements:
- Add instance variable to many default dashboard graphs (#49174)
- Display duration in milliseconds if under 1 second. (#49126)
- Add
RAY_PROMETHEUS_HEADERS
env for carrying additional headers to Prometheus (#49353) - Document about the
RAY_PROMETHEUS_HEADERS
env for carrying additional headers to Prometheus (#49700)
🏗 Architecture refactoring:
- Move
memray
dependency from default to observability (#47763) - Move
StateHead
's methods into free functions. (#49388)
Thanks
@raulchen, @alanwguo, @omatthew98, @xingyu-long, @tlinkin, @yantzu, @alexeykudinkin, @andrewsykim, @win5923, @csy1204, @dayshah, @richardliaw, @stephanie-wang, @gueraf, @rueian, @davidxia, @fscnick, @wingkitlee0, @KPostOffice, @GeneDer, @MengjinYan, @simonsays1980, @pcmoritz, @petern48, @kashiwachen, @pfldy2850, @zcin, @scottjlee, @Akhil-CM, @Jay-ju, @JoshKarpel, @edoakes, @ruisearch42, @gorloffslava, @jimmyxie-figma, @bthananjeyan, @sven1977, @bnorick, @jeffreyjeffreywang, @ravi-dalal, @matthewdeng, @angelinalg, @ivanthewebber, @rkooo567, @srinathk10, @maresb, @gvspraveen, @akyang-anyscale, @mimiliaogo, @bveeramani, @ryanaoleary, @kevin85421, @richardsliu, @hartikainen, @coltwood93, @mattip, @Superskyyy, @justinvyu, @hongpeng-guo, @ArturNiederfahrenhorst, @jecsand838, @Bye-legumes, @hcc429, @WeichenXu123, @martinbomio, @HollowMan6, @MortalHappiness, @dentiny, @zhe-thoughts, @anyadontfly, @smanolloff, @richo-anyscale, @khluu, @xushiyan, @rynewang, @japneet-anyscale, @jjyao, @sumanthratna, @saihaj, @aslonnie
Many thanks to all those who contributed to this release!
Ray-2.40.0
Ray Libraries
Ray Data
🎉 New Features:
- Added read_hudi (#46273)
💫 Enhancements:
- Improved performance of DelegatingBlockBuilder (#48509)
- Improved memory accounting of pandas blocks (#46939)
🔨 Fixes:
- Fixed bug where you can’t specify a schema with write_parquet (#48630)
- Fixed bug where to_pandas errors if your dataset contains Arrow and pandas blocks (#48583)
- Fixed bug where map_groups doesn’t work with pandas data (#48287)
- Fixed bug where write_parquet errors if your data contains nullable fields (#48478)
- Fixed bug where “Iteration Blocked Time” charts looks incorrect (#48618)
- Fixed bug where unique fails with null values (#48750)
- Fixed bug where “Rows Outputted” is 0 in the Data dashboard (#48745)
- Fixed bug where methods like drop_columns cause spilling (#48140)
- Fixed bug where async map tasks hang (#48861)
🗑️ Deprecations:
- Deprecated read_parquet_bulk #48691
- Deprecated iter_tf_batches #48693
- Deprecated meta_provider parameter of read functions (#48690)
- Deprecated to_torch (#48692)
Ray Train
🔨 Fixes:
- Fix StartTracebackWithWorkerRank serialization (#48548)
📖 Documentation:
- Add example for fine-tuning Llama3.1 with AWS Trainium (#48768)
Ray Tune
🔨 Fixes:
- Remove the
clear_checkpoint
function during Trial restoration error handling. (#48532)
Ray Serve
🎉 New Features:
- Initial version of local_testing_mode (#48477)
💫 Enhancements:
- Handle multiple changed objects per LongPollHost.listen_for_change RPC (#48803)
- Add more nuanced checks for http proxy status errors (#47896)
- Improve replica access log messages to include HTTP status info and better resemble standard log format (#48819)
- Propagate replica constructor error to deployment status message and print num retries left (#48531)
🔨 Fixes:
- Pending requests that are cancelled before they were assigned to a replica now also return a serve.RequestCancelledError (#48496)
RLlib
💫 Enhancements:
- Release test enhancements. (#45803, #48681)
- Make opencv-python-headless default over opencv-python (#48776)
- Reverse learner queue behavior of IMPALA/APPO (consume oldest batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)
🔨 Fixes:
- Fix torch scheduler stepping and reporting. (#48125)
- Fix accumulation of results over n training_step calls within same iteration (new API stack). (#48136)
- Various other fixes: #48563, #48314, #48698, #48869.
📖 Documentation:
- Upgrade examples script overview page (new API stack). (#48526)
- Enable RLlib + Serve example in CI and translate to new API stack. (#48687)
🏗 Architecture refactoring:
- Switch new API stack on by default, APPO, IMPALA, BC, MARWIL, and CQL. (#48516, #48599)
- Various APPO enhancements (new API stack): Circular buffer (#48798), minor loss math fixes (#48800), target network update logic (#48802), smaller cleanups (#48844).
- Remove
rllib_contrib
from repo. (#48565)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- [Core] uv runtime env support (#48479, #48486, #48611, #48619, #48632, #48634, #48637, #48670, #48731)
- [Core] GCS FT with redis sentinel (#47335)
💫 Enhancements:
- [CompiledGraphs] Refine schedule visualization (#48594)
🔨 Fixes:
- [CompiledGraphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs (#48463)
- [Core] Fix Ascend NPU discovery to support 8+ cards per node (#48543)
- [Core] Make Placement Group Wildcard and Indexed Resource Assignments Consistent (#48088)
- [Core] Stop the GRPC server before Shut down the Object Store (#48572)
Ray Clusters
🔨 Fixes:
- [KubeRay]: Fix ConnectionError on Autoscaler CR lookups in K8s clusters with custom DNS for Kubernetes API. (#48541)
Dashboard
💫 Enhancements:
- Add global UTC timezone button in navbar with local storage (#48510)
- Add memory graphs optimized for OOM debugging (#48530)
- Improve tasks/actors metric naming and add graph for running tasks (#48528)
add actor pid to dashboard (#48791)
🔨 Fixes:
- Fix Placement Group Table table cells overflow (#47323)
- Fix Rows Outputted being zero on Ray Data Dashboard (#48745)
- fix confusing dataset operator name (#48805)
Thanks
Thanks to all those who contributed to this release!
@rynewang, @rickyyx, @bveeramani, @marwan116, @simonsays1980, @dayshah, @dentiny, @KepingYan, @mimiliaogo, @kevin85421, @SeaOfOcean, @stephanie-wang, @mohitjain2504, @azayz, @xushiyan, @richardliaw, @can-anyscale, @xingyu-long, @kanwang, @aslonnie, @MortalHappiness, @jjyao, @SumanthRH, @matthewdeng, @alexeykudinkin, @sven1977, @raulchen, @andrewsykim, @zcin, @nadongjun, @hongpeng-guo, @miguelteixeiraa, @saihaj, @khluu, @ArturNiederfahrenhorst, @ryanaoleary, @ltbringer, @pcmoritz, @JoshKarpel, @akyang-anyscale, @frances720, @BeingGod, @edoakes, @Bye-legumes, @Superskyyy, @liuxsh9, @MengjinYan, @ruisearch42, @scottjlee, @angelinalg
Ray-2.39.0
Ray Libraries
Ray Data
🔨 Fixes:
- Fixed InvalidObjectError edge case with Dataset.split() (#48130)
- Made Concatenator preserve order of concatenated columns (#47997)
📖 Documentation:
- Improved documentation around Parquet column and predicate pushdown (#48095)
- Marked num_rows_per_file parameter of write APIs as experimental (#48208)
- One hot encoder now returns an encoded vector (#48173)
- transform_batch no longer fails on missing columns (#48137)
🏗 Architecture refactoring:
- Dataset.count() now uses a Count logical operator (#48126)
🗑 Deprecations:
- Removed long-deprecated set_progress_bars (#48203)
Ray Train
🔨 Fixes:
- Safely check if the storage filesystem is
pyarrow.fs.S3FileSystem
(#48216)
Ray Tune
🔨 Fixes:
- Safely check if the storage filesystem is
pyarrow.fs.S3FileSystem
(#48216)
Ray Serve
💫 Enhancements:
- Cancelled requests now return a serve.RequestCancelledError (#48444)
- Exposed application source in app details model (#45522)
🔨 Fixes:
- Basic HTTP deployments will now return “Internal Server Error” instead of a traceback to match FastAPI behavior (#48491)
- Fixed an issue where high values of max_ongoing_requests couldn’t be reached due to an interaction with core’s max_concurrency (#48274)
- Fixed an edge case where pending requests were not canceled properly (#47873)
- Removed deprecated API to set route_prefix per-deployment (#48223)
📖 Documentation:
- Added ProxyStatus model to reference docs (#48299)
- Added ApplicationStatus model to reference docs (#48220)
RLlib
💫 Enhancements:
- Upgrade to gymnasium==1.0.0 (support new API for vector env resets). (#48443, #45328)
- Add off-policy'ness metric to new API stack. (#48227)
- Validate episodes before adding them to the buffer. (#48083)
📖 Documentation:
- New example script for custom metrics on
EnvRunners
(usingMetricsLogger
API on the new stack). (#47969) - Do-over: New RLlib index page. (#48285, #48442)
- Do-over: Example script for AutoregressiveActionsRLM. (#47972)
🏗 Architecture refactoring:
- New API stack on by default for PPO. (#48284)
- Change config.fault_tolerance default behavior (from
recreate_failed_env_runners=False
toTrue
). (#48286)
🔨 Fixes:
Ray Core
🎉 New Features:
- [CompiledGraphs] Support all reduce collective in aDAG (#47621)
- [CompiledGraphs] Add visualization of compiled graphs (#47958)
💫 Enhancements:
- [Distributed Debugger] The distributed debugger can now be used without having to set RAY_DEBUG=1, see #48301 and https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html. If you want to restore the previous behavior and use the CLI based debugger, you need to set RAY_DEBUG=legacy.
- [Core] Add more infos to each breakpoint for ray debug CLI (#48202)
- [Core] Add demands info to GCS debug state (#48115)
- [Core] Add PENDING_ACTOR_TASK_ARGS_FETCH and PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus (#48242)
- [Core] Add metrics ray_io_context_event_loop_lag_ms. (#47989)
- [Core] Better log format when show the disk size (#46869)
- [CompiledGraphs] Support asyncio.gather on multiple CompiledDAGFutures (#47860)
- [CompiledGraphs] Raise an exception if a leaf node is found during compilation (#47757)
🔨 Fixes:
- [Core] Posts CoreWorkerMemoryStore callbacks onto io_context to fix deadlock (#47833)
Dashboard
🔨 Fixes:
- [Dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count (#48229)
Thanks
Many thanks to all those who contributed to this release!
@akyang-anyscale, @rkooo567, @bveeramani, @dayshah, @martinbomio, @khluu, @justinvyu, @slfan1989, @alexeykudinkin, @simonsays1980, @vigneshka, @ruisearch42, @rynewang, @scottjlee, @jjyao, @JoshKarpel, @win5923, @MengjinYan, @MortalHappiness, @ujjawal-khare-27, @zcin, @ccoulombe, @Bye-legumes, @dentiny, @stephanie-wang, @LeoLiao123, @dengwxn, @richo-anyscale, @pcmoritz, @sven1977, @omatthew98, @GeneDer, @srinathk10, @can-anyscale, @edoakes, @kevin85421, @aslonnie, @jeffreyjeffreywang, @ArturNiederfahrenhorst
Ray-2.38.0
Ray Libraries
Ray Data
🎉 New Features:
💫 Enhancements:
- Add
partitioning
parameter toread_parquet
(#47553) - Add
SERVICE_UNAVAILABLE
to list of retried transient errors (#47673) - Re-phrase the streaming executor current usage string (#47515)
- Remove ray.kill in ActorPoolMapOperator (#47752)
- Simplify and consolidate progress bar outputs (#47692)
- Refactor
OpRuntimeMetrics
to support properties (#47800) - Refactor
plan_write_op
andDatasink
s (#47942) - Link
PhysicalOperator
to itsLogicalOperator
(#47986) - Allow specifying both
num_cpus
andnum_gpus
for map APIs (#47995) - Allow specifying insertion index when registering custom plan optimization
Rule
s (#48039) - Adding in better framework for substituting logging handlers (#48056)
🔨 Fixes:
- Fix bug where Ray Data incorrectly emits progress bar warning (#47680)
- Yield remaining results from async
map_batches
(#47696) - Fix event loop mismatch with async map (#47907)
- Make sure
num_gpus
provide to Ray Data is appropriately passed toray.remote
call (#47768) - Fix unequal partitions when grouping by multiple keys (#47924)
- Fix reading multiple parquet files with ragged ndarrays (#47961)
- Removing unneeded test case (#48031)
- Adding in better json checking in test logging (#48036)
- Fix bug with inserting custom optimization rule at index 0 (#48051)
- Fix logging output from
write_xxx
APIs (#48096)
📖 Documentation:
- Add docs section for Ray Data progress bars (#47804)
- Add reference to parquet predicate pushdown (#47881)
- Add tip about how to understand map_batches format (#47394)
Ray Train
🏗 Architecture refactoring:
- Remove deprecated mosaic and sklearn trainer code (#47901)
Ray Tune
🔨 Fixes:
- Fix WandbLoggerCallback to reuse actors upon restore (#47985)
Ray Serve
🔨 Fixes:
- Stop scheduling task early when requests have been canceled (#47847)
RLlib
🎉 New Features:
- Enable cloud checkpointing. (#47682)
💫 Enhancements:
- PPO on new API stack now shuffles batches properly before each epoch. (#47458)
- Other enhancements: #47705, #47501, #47731, #47451, #47830, #47970, #47157
🔨 Fixes:
- Fix spot node preemption problem (RLlib now run stably with EnvRunner workers on spot nodes) (#47940)
- Fix action masking example. (#47817)
- Various other fixes: #47973, #46721, #47914, #47880, #47304, #47686
🏗 Architecture refactoring:
- Switch on new API stack by default for SAC and DQN. (#47217)
- Remove Tf support on new API stack for PPO/IMPALA/APPO (only DreamerV3 on new API stack remains with tf now). (#47892)
- Discontinue support for "hybrid" API stack (using RLModule + Learner, but still on RolloutWorker and Policy) (#46085)
- RLModule (new API stack) refinements: #47884, #47885, #47889, #47908, #47915, #47965, #47775
📖 Documentation:
- Add new API stack migration guide. (#47779)
- New API stack example script: BC pre training, then PPO finetuning using same RLModule class. (#47838)
- New API stack: Autoregressive actions example. (#47829)
- Remove old API stack connector docs entirely. (#47778)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- CompiledGraphs: support multi readers in multi node when DAG is created from an actor (#47601)
💫 Enhancements:
- Add a flag to raise exception for out of band serialization of
ObjectRef
(#47544) - Store each GCS table in its own Redis Hash (#46861)
- Decouple create worker vs pop worker request. (#47694)
- Add metrics for GCS jobs (#47793)
🔨 Fixes:
- Fix broken dashboard cluster page when there are dead nodes (#47701)
- Fix the
ray_tasks{State="PENDING_ARGS_FETCH"}
metric counting (#47770) - Separate the attempt_number with the task_status in memory summary and object list (#47818)
- Fix object reconstruction hang on arguments pending creation (#47645)
- Fix check failure:
sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end()
(#47861) - Fix check failure
RAY_CHECK(it != current_tasks_.end())
; (#47659)
📖 Documentation:
- KubeRay docs: Add docs for YuniKorn Gang scheduling #47850
Dashboard
💫 Enhancements:
- Performance improvements for large scale clusters (#47617)
🔨 Fixes:
- Placement group and required resources not showing correctly in dashboard (#47754)
Thanks
Many thanks to all those who contributed to this release!
@GeneDer, @rkooo567, @dayshah, @saihaj, @nikitavemuri, @bill-oconnor-anyscale, @WeichenXu123, @can-anyscale, @jjyao, @edoakes, @kekulai-fredchang, @bveeramani, @alexeykudinkin, @raulchen, @khluu, @sven1977, @ruisearch42, @dentiny, @MengjinYan, @Mark2000, @simonsays1980, @rynewang, @PatricYan, @zcin, @sofianhnaide, @matthewdeng, @dlwh, @scottjlee, @MortalHappiness, @kevin85421, @win5923, @aslonnie, @prithvi081099, @richardsliu, @milesvant, @omatthew98, @Superskyyy, @pcmoritz
Ray-2.37.0
Ray Libraries
Ray Data
💫 Enhancements:
- Simplify custom metadata provider API (#47575)
- Change counts of metrics to rates of metrics (#47236)
- Throw exception for non-streaming HF datasets with "override_num_blocks" argument (#47559)
- Refactor custom optimizer rules (#47605)
🔨 Fixes:
- Remove ineffective retry code in
plan_read_op
(#47456) - Fix incorrect pending task size if outputs are empty (#47604)
Ray Train
💫 Enhancements:
- Update run status and add stack trace to
TrainRunInfo
(#46875)
Ray Serve
💫 Enhancements:
- Allow control of some serve configuration via env vars (#47533)
- [serve] Faster detection of dead replicas (#47237)
🔨 Fixes:
- [Serve] fix component id logging field (#47609)
RLlib
💫 Enhancements:
- New API stack:
- Add restart-failed-env option to EnvRunners. (#47608)
- Offline RL: Store episodes in state form. (#47294)
- Offline RL: Replace GAE in MARWILOfflinePreLearner with
GeneralAdvantageEstimation
connector in learner pipeline. (#47532) - Off-policy algos: Add episode sampling to EpisodeReplayBuffer. (#47500)
- RLModule APIs: Add
SelfSupervisedLossAPI
for RLModules that bring their own loss andInferenceOnlyAPI
. (#47581, #47572)
Ray Core
💫 Enhancements:
- [aDAG] Allow custom NCCL group for aDAG (#47141)
- [aDAG] support buffered input (#47272)
- [aDAG] Support multi node multi reader (#47480)
- [Core] Make is_gpu, is_actor, root_detached_id fields late bind to workers. (#47212)
- [Core] Reconstruct actor to run lineage reconstruction triggered actor task (#47396)
- [Core] Optimize GetAllJobInfo API for performance (#47530)
🔨 Fixes:
- [aDAG] Fix ranks ordering for custom NCCL group (#47594)
Ray Clusters
📖 Documentation:
- [KubeRay] add a guide for deploying vLLM with RayService (#47038)
Thanks
Many thanks to all those who contributed to this release!
@ruisearch42, @andrewsykim, @timkpaine, @rkooo567, @WeichenXu123, @GeneDer, @sword865, @simonsays1980, @angelinalg, @sven1977, @jjyao, @woshiyyya, @aslonnie, @zcin, @omatthew98, @rueian, @khluu, @justinvyu, @bveeramani, @nikitavemuri, @chris-ray-zhang, @liuxsh9, @xingyu-long, @peytondmurray, @rynewang