29 Apr 21:21

khluu

4883bd5

Ray-2.45.0 Latest

Latest

Ray Core

💫 Enhancements

Make Object Store Fallback Directory configurable (#51189).
[cgraph] Support with_tensor_transport(transport='shm') (#51872).
[cgraph] Support reduce scatter and all gather collective for GPU communicator in compiled graph (#50624).

🔨 Fixes

Make sure KillActor RPC with force_kill=True can actually kill the threaded actor (#51414).
[Autoscaler] Do not remove idle nodes for upcoming placement groups (#51122).
Threaded actors get stuck forever if they receive two exit signals (#51582).
[cgraph] Fix illegal memory access of cgraph when used in PP (#51734).
Avoid resubmitted actor tasks from hanging indefinitely (#51904).
Fix interleaved placement group creation process due to node failure (#52202).
Flush task events in CoreWorker::Shutdown instead of CoreWorker::Disconnect (#52374).

🏗 Architecture refactoring

Split dashboard single process into multiple processes to improve stability and avoid interference between different heads (#51282, #51489, #51555, #51507, #51587, #51553, #51676, #51733, #51809, #51877, #51876, #51980, #52114).

Ray Libraries

Ray Data

🎉 New Features

New ClickHouse sink via Dataset.write_clickhouse() (#50377)
Support ray_remote_args_fn in Dataset.groupby().map_groups() to set per-group runtime env and resource hints (#51236)
Expose Dataset.name / set_name as a public API for easier lineage tracking (#51076)
Allow async callable classes in Dataset.flat_map() (#51180)
Introduce Ruleset abstraction for rule-based query optimisation (#51558)
Add seamless conversion from Daft DataFrame to Ray Dataset (#51531)
Improved support for line-delimited JSONL reading in read_json() (#52083)
Provide Dataset.export_metadata() for schema & stats snapshots (#52227)

💫 Enhancements

Improved performance of sorting and sort-shuffle based operations (by more than 5x in benchmarks) (#51943)
Metrics: number of map-actor workers alive / pending / restarting (#51082)
Continuous memory-usage polling per map task (#51324)
Auto-tune map-task memory based on output size (#51536)
More informative back-pressure progress bar (#51697)
Faster RefBundle.get_cached_location() lookup (#52097)
Speed-up for PandasBlock.size_bytes() (#52510)
Expanded BlockColumnAccessor utilities and ops (#51326, #51571)

🔨 Fixes

Correct MapTransformFn.__eq__ equality check (#51434)
Persist unresolved wildcard paths in FileBasedDataSource (#51424)
Repair Hugging Face dynamic-module loading on workers (#51488)
Prevent HTTP URLs from being expanded by _expand_paths (#50178)
Fix Databricks host-URL parsing in Delta datasource (#49926)
Restore reproducibility of Dataset.random_sample() (#51401)
Correct RandomAccessDataset.multiget() return values (#51421)
Ensure executor shutdown after schema fetch to avoid leaked actors (#52379)
Repair streaming shutdown regression (#52509)
Honour minimum resource reservation in ResourceManager (#52226)

📖 Documentation

Clarified shuffle-section wording (#51289)
Documented concurrency semantics in API reference (#51963)
Updated Ray Data guides for the 2.45 release (#52082)

Ray Train

🎉 New Features

Fold v2.LightGBMTrainer API into the public trainer class as an alternate constructor (#51265).

💫 Enhancements

Use the user-defined function name as the training thread name (#52514).
Upgrade LightGBM to version 4.6.0 (#52410).
Adjust test size further for better results (#52283).
Log errors raised by workers during training (#52223).
Add worker group setup finished log to track progress (#52120).
Change test_telemetry to medium size (#52178).
Improve dataset name observability for better tracking (#52059).
Differentiate between train v1 and v2 export data for clarity (#51728).
Include scheduling status detail to improve debugging (#51480).
Move train library usage check to Trainer initialization (#50966).

🔨 Fixes

Separate OutputSplitter._locality_hints from actor_locality_enabled and locality_with_output (#52005).
Fix print redirection to handle new lines correctly (#51542).
Mark RunAttempt workers as dead after completion to avoid stale states (#51540).
Fix setup_wandb rank_zero_only logic (#52381).

📖 Documentation

Add links to the Train v2 migration guide in the Train API pages (#51924).

🏗 Architecture refactoring

Replace AMD device environment variable with HIP_VISIBLE_DEVICES (#51104).
Remove unnecessary string literal splits (#47360).

Ray Tune

📖 Documentation

Improve Tune documentation structure (#51684).
Fix syntax errors in Ray Tune example pbt_ppo_example.ipynb (#51626).

Ray Serve

🎉 New Features

Add request timeout sec for gRPC (#52276).
[Serve.llm] ray.llm support custom accelerators (#51359).

💫 Enhancements

Improve Serve deploy ignore behavior (#49336).
[Serve.llm] Telemetry GPU type fallback to cluster hardware when unspecified (#52003).

🔨 Fixes

Fix multiplex fallback logic during burst requests (#51389).
Don't stop retrying replicas when a deployment is scaling back up from zero (#51600).
Remove RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE flag (#51649).
Remove RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS flag (#51722).
Unify request cancellation errors (#51768).
Catch timeout error when checking if proxy is dead (#52002).
Suppress cancelled errors in proxy (#52423).
[Serve.llm] Fix loading model from remote storage and add docs (#51617).
[Serve.llm] Fix ServeReplica deployment failure for DeepSeek (#51989).
[Serve.llm] Check GPUType enum value rather than enum itself ([#52037]...

Contributors

pcmoritz, denadai2, and 89 other contributors

Assets 2

27 Mar 17:13

omatthew98

ray-2.44.1

daca7b2

Ray-2.44.1

There is no difference between 2.44.1 and 2.44.0, though we needed a patch version for other out of band reasons. To fill the awkward blankness, here is a haiku about Ray:

Under screen-lit skies
A ray of bliss in each patch
Joy at any scale

Assets 2

21 Mar 05:15

aslonnie

ray-2.44.0

36bed82

Ray-2.44.0

Release Highlights

This release features Ray Compiled Graph (beta). Ray Compiled Graph gives you a classic Ray Core-like API, but with (1) less than 50us system overhead for workloads that repeatedly execute the same task graph; and (2) native support for GPU-GPU communication via NCCL. Ray Compiled Graph APIs simplify high-performance multi-GPU workloads such as LLM inference and training. The beta release refines the API, enhances stability, and adds or improves features like visualization, profiling and experimental GPU compute/computation overlap. For more information, refer to Ray documentation: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html
The experimental Ray Workflows library has been deprecated and will be removed in a future version of Ray. Ray Workflows has been marked experimental since its inception and hasn’t been maintained due to the Ray team focusing on other priorities. If you are using Ray Workflows, we recommend pinning your Ray version to 2.44.

Ray Libraries

Ray Data

🎉 New Features:

Add Iceberg write support through pyiceberg (#50590 )
[LLM] Various feature enhancements to Ray Data LLM, including LoRA support #50804 and structured outputs #50901

💫 Enhancements:

Add dataset/operator state, progress, total metrics (#50770)
Make chunk combination threshold configurable (#51200)
Store average memory use per task in OpRuntimeMetrics (#51126)
Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks (#51238)
Append-mode API for preprocessors -- #50848, #50847, #50642, #50856, #50584. Note that vectorizers and hashers now output a single column instead 1 column per feature. In the near future, we will be graduating preprocessors to beta.

🔨 Fixes:

Fixing Map Operators to avoid unconditionally overriding generator's back-pressure configuration (#50900)
Fix filter expr equating negative numbers (#50932)
Fix error message for override_num_blocks when reading from a HuggingFace Dataset (#50998)
Make num_blocks in repartition optional (#50997)
Always pin the seed when doing file-based random shuffle (#50924)
Fix StandardScaler to handle NaN stats (#51281)

Ray Train

🎉 New Features:

Implement state export API (#50622, #51085, #51177)

💫 Enhancements:

Folded v2.XGBoostTrainer API into the public trainer class as an alternate constructor (#50045)
Created a default ScalingConfig if one is not provided to the trainer (#51093)
Improved TrainingFailedError message (#51199)
Utilize FailurePolicy factory (#51067)

🔨 Fixes:

Fixed trainer import deserialization when captured within a Ray task (#50862)
Fixed serialize import test for Python 3.12 (#50963)
Fixed RunConfig deprecation message in Tune being emitted in trainer.fit usage (#51198)

📖 Documentation:

[Train V2] Updated API references (#51222)
[Train V2] Updated persistent storage guide (#51202)
[Train V2] Updated user guides for metrics, checkpoints, results, and experiment tracking (#51204)
[Train V2] Added updated Train + Tune user guide (#51048)
[Train V2] Added updated fault tolerance user guide (#51083)
Improved HF Transformers example (#50896)
Improved Train DeepSpeed example (#50906)
Use correct mean and standard deviation norm values in image tutorials (#50240)

🏗 Architecture refactoring:

Deprecated Torch AMP wrapper utilities (#51066)
Hid private functions of train context to avoid abuse (#50874)
Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
Moved library usage tests out of core (#51161)

Ray Tune

📖 Documentation:

Various improvements to Tune Pytorch CIFAR tutorial (#50316)
Various improvements to the Ray Tune XGBoost tutorial (#50455)
Various enhancements to Tune Keras example (#50581)
Minor improvements to Hyperopt tutorial (#50697)
Various improvements to LightGBM tutorial (#50704)
Fixed non-runnable Optuna tutorial (#50404)
Added documentation for Asynchronous HyperBand Example in Tune (#50708)
Replaced reuse actors example with a fuller demonstration (#51234)
Fixed broken PB2/RLlib example (#51219)
Fixed typo and standardized equations across the two APIs (#51114)
Improved PBT example (#50870)
Removed broken links in documentation (#50995, #50996)

🏗 Architecture refactoring:

Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
Moved library usage tests out of core (#51161)

Ray Serve

🎉 New Features:

Faster bulk imperative Serve Application deploys (#49168)
[LLM] Add gen-config (#51235)

💫 Enhancements:

Clean up shutdown behavior of serve (#51009)
Add additional_log_standard_attrs to serve logging config (#51144)
[LLM] remove asyncache and cachetools from dependencies (#50806)
[LLM] remove backoff dependency (#50822)
[LLM] Remove asyncio_timeout from ray[llm] deps on python<3.11 (#50815)
[LLM] Made JSON validator a singleton and jsonref packages lazy imported (#50821)
[LLM] Reuse AutoscalingConfig and DeploymentConfig from Serve (#50871)
[LLM] Use pyarrow FS for cloud remote storage interaction (#50820)
[LLM] Add usage telemetry for serve.llm (#51221)

🔨 Fixes:

Exclude redirects from request error count (#51130)
[LLM] Fix the wrong device_capability issue in vllm on quantized models (#51007)
[LLM] add gen-config related data file to the package (#51347)

📖 Documentation:

[LLM] Fix quickstart serve LLM docs (#50910)
[LLM] update build_openai_app to include yaml example (#51283)
[LLM] remove old vllm+serve doc (#51311)

RLlib

💫 Enhancements:

APPO/IMPALA accelerate:
- LearnerGroup should not pickle remote functions on each update-call; Refactor LearnerGroup and Learner APIs. (#50665)
- EnvRunner sync enhancements. (#50918 )
- Various other speedups: #51302, #50923, #50919, #50791
Unify namings for actor managers' outstanding in-flight requests metrics. (#51159)
Add timers to env step, forward pass, and complete connector pipelines runs. (#51160)

🔨 Fixes:

Multi-agent env vectorization:
- Fix MultiAgentEnvRunner env check bug. (#50891 )
- Add single_action_space and single_observation_space to VectorMultiAgentEnv. (#51096)
Other fixes: #51255, #50920, #51369

📖 Documentation:

Smaller fixes: #51015, #51219

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Enhanced uv support (#51233)

💫 Enhancements:

Made infeasible task errors much more obvious (#45909)
Log rotation for workers, runtime env agent, and dashboard agent (#50759, #50877, #50909)
Support customizing gloo timeout (#50223)
Support torch profiling in Compiled Graph (#51022)
Change default tensor deserialization in Compiled Graph (#50778)
Use current node id if no node is specified on ray drain-node (#51134)

🔨 Fixes:

Fixed an issue where the raylet continued to have high CPU overhead after a job was terminated ([...

Contributors

pcmoritz, alexeykudinkin, and 68 other contributors

Assets 2

27 Feb 19:57

khluu

ray-2.43.0

744eaa9

Ray-2.43.0

Highlights

This release features new modules in Ray Serve and Ray Data for integration with large language models, marking the first step of addressing #50639. Existing Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine. In this release, we offer APIs for both batch inference and serving of LLMs within Ray in ray.data.llm and ray.serve.llm. See the below notes for more details. These APIs are marked as alpha -- meaning they may change in future releases without a deprecation period.
Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the RAY_TRAIN_V2_ENABLED=1 environment variable. See the migration guide for more information.
A new integration with uv run that allows easily specifying Python dependencies for both driver and workers in a consistent way and enables quick iterations for development of Ray applications (#50160, 50462), check out our blog post

Ray Libraries

Ray Data

🎉 New Features:

Ray Data LLM: We are introducing a new module in Ray Data for batch inference with LLMs (currently marked as alpha). It offers a new Processor abstraction that interoperates with existing Ray Data pipelines. This abstraction can be configured two ways:
- Using the vLLMEngineProcessorConfig, which configures vLLM to load model replicas for high throughput model inference
- Using the HttpRequestProcessorConfig, which sends HTTP requests to an OpenAI-compatible endpoint for inference.
- Documentation for these features can be found here.
Implement accurate memory accounting for UnionOperator (#50436)
Implement accurate memory accounting for all-to-all operations (#50290)

💫 Enhancements:

Support class constructor args for filter() (#50245)
Persist ParquetDatasource metadata. (#50332)
Rebasing ShufflingBatcher onto try_combine_chunked_columns (#50296)
Improve warning message if required dependency isn't installed (#50464)
Move data-related test logic out of core tests directory (#50482)
Pass executor as an argument to ExecutionCallback (#50165)
Add operator id info to task+actor (#50323)
Abstracting common methods, removing duplication in ArrowBlockAccessor, PandasBlockAccessor (#50498)
Warn if map UDF is too large (#50611)
Replace AggregateFn with AggregateFnV2, cleaning up Aggregation infrastructure (#50585)
Simplify Operator.repr (#50620)
Adding in TaskDurationStats and on_execution_step callback (#50766)
Print Resource Manager stats in release tests (#50801)

🔨 Fixes:

Fix invalid escape sequences in grouped_data.py docstrings (#50392)
Deflake test_map_batches_async_generator (#50459)
Avoid memory leak with pyarrow.infer_type on datetime arrays (#50403)
Fix parquet partition cols to support tensors types (#50591)
Fixing aggregation protocol to be appropriately associative (#50757)

📖 Documentation:

Remove "Stable Diffusion Batch Prediction with Ray Data" example (#50460)

Ray Train

🎉 New Features:

Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the RAY_TRAIN_V2_ENABLED=1 environment variable. See the migration guide for more information.

💫 Enhancements:

Add a training ingest benchmark release test (#50019, #50299) with a fault tolerance variant (#50399)
Add telemetry for Trainer usage in V2 (#50321)
Add pydantic as a ray[train] extra install (#46682)
Add state tracking to train v2 to make run status, run attempts, and training worker metadata observable (#50515)

🔨 Fixes:

Increase doc test parallelism (#50326)
Disable TF test for py312 (#50382)
Increase test timeout to deflake (#50796)

📖 Documentation:

Add missing xgboost pip install in example (#50232)

🏗 Architecture refactoring:

Add deprecation warnings pointing to a migration guide for Ray Train V2 (#49455, #50101, #50322)
Refactor internal Train controller state management (#50113, #50181, #50388)

Ray Tune

🔨 Fixes:

Fix worker node failure test (#50109)

📖 Documentation:

Update all doc examples off of ray.train imports (#50458)
Update all ray/tune/examples off of ray.train imports (#50435)
Fix typos in persistent storage guide (#50127)
Remove Binder notebook links in Ray Tune docs (#50621)

🏗 Architecture refactoring:

Update RLlib to use ray.tune imports instead of ray.air and ray.train (#49895)

Ray Serve

🎉 New Features:

Ray Serve LLM: We are introducing a new module in Ray Serve to easily integrate open source LLMs in your Ray Serve deployment, currently marked as alpha. This opens up a powerful capability of composing complex applications with multiple LLMs, which is a use case in emerging applications like agentic workflows. Ray Serve LLM offers a couple core components, including:
- VLLMService: A prebuilt deployment that offers a full-featured vLLM engine integration, with support for features such as LoRA multiplexing and multimodal language models.
- LLMRouter: An out-of-the-box OpenAI compatible model router that can route across multiple LLM deployments.
- Documentation can be found at https://docs.ray.io/en/releases-2.43.0/serve/llm/overview.html

💫 Enhancements:

Add required_resources to REST API (#50058)

🔨 Fixes:

Fix batched requests hanging after cancellation (#50054)
Properly propagate backpressure error (#50311)

RLlib

🎉 New Features:

Added env vectorization support for multi-agent (new API stack). (#50437)

💫 Enhancements:

APPO/IMPALA various acceleration efforts. Reached 100k ts/sec on Atari benchmark with 400 EnvRunners and 16 (multi-node) GPU Learners: #50760, #50162, #50249, #50353, #50368, #50379, #50440, #50477, #50527, #50528, #50600, #50309
Offline RL:
- Remove all weight synching to eval_env_runner_group from the training steps. (#50057)
- Enable single-learner/multi-learner GPU training. (#50034)
- Remove reference to MARWILOfflinePreLearner in OfflinePreLearner docstring. (#50107)
- Add metrics to multi-agent replay buffers. (#49959 )

🔨 Fixes:

Fix SPOT preemption tolerance for large AlgorithmConfig: Pass by reference to RolloutWorker...

Contributors

shorbaji, eddyxu, and 75 other contributors

Assets 2

11 Feb 22:21

aslonnie

ray-2.42.1

c2e38f7

Ray-2.42.1

Ray Data

🔨 Fixes:

Fixes incorrect assertion (#50210)

Assets 2

05 Feb 00:42

dayshah

ray-2.42.0

637116a

Ray-2.42.0

Ray Libraries

Ray Data

🎉 New Features:

Added read_audio and read_video (#50016)

💫 Enhancements:

Optimized multi-column groupbys (#45667)
Included Ray user-agent in BigQuery client construction (#49922)

🔨 Fixes:

Fixed bug that made read tasks non-deterministic (#49897)

🗑️ Deprecations:

Deprecated num_rows_per_file in favor of min_rows_per_file (#49978)

Ray Train

💫 Enhancements:

Add Train v2 user-facing callback interface (#49819)
Add TuneReportCallback for propagating intermediate Train results to Tune (#49927)

Ray Tune

📖 Documentation:

Fix BayesOptSearch docs (#49848)

Ray Serve

💫 Enhancements:

Cache metrics in replica and report on an interval (#49971)
Cache expensive calls to inspect.signature (#49975)
Remove extra pickle serialization for gRPCRequest (#49943)
Shared LongPollClient for Routers (#48807)
DeploymentHandle API is now stable (#49840)

🔨 Fixes:

Fix batched requests hanging after request cancellation bug (#50054)

RLlib

💫 Enhancements:

Add metrics to replay buffers. (#49822)
Enhance node-failure tolerance (new API stack). (#50007)
MetricsLogger cleanup throughput logic. (#49981)
Split AddStates... connectors into 2 connector pieces (AddTimeDimToBatchAndZeroPad and AddStatesFromEpisodesToBatch) (#49835)

🔨 Fixes:

Old API stack IMPALA/APPO: Re-introduce mixin-replay-buffer pass, even if replay-ratio=0 (fixes a memory leak). (#49964)
Fix MetricsLogger race conditions. (#49888)
APPO/IMPALA: Bug fix for > 1 Learner actor. (#49849)

📖 Documentation:

New MetricsLogger API rst page. (#49538)
Move "new API stack" info box right below page titles for better visibility. (#49921)
Add example script for how to log custom metrics in training_step(). (#49976)
Enhance/redo autoregressive action distribution example. (#49967)
Make the "tiny CNN" example RLModule run with APPO (by implementing TargetNetAPI) (#49825)

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

Only get single node info rather then all when needed (#49727)
Introduce with_tensor_transport API (#49753)

🔨 Fixes:

Fix tqdm manager thread safe #50040

Ray Clusters

🔨 Fixes:

Fix token expiration for ray autoscaler (#48481)

Thanks

Thank you to everyone who contributed to this release! 🥳
@wingkitlee0, @saihaj, @win5923, @justinvyu, @kevin85421, @edoakes, @cristianjd, @rynewang, @richardliaw, @LeoLiao123, @alexeykudinkin, @simonsays1980, @aslonnie, @ruisearch42, @pcmoritz, @fscnick, @bveeramani, @mattip, @till-m, @tswast, @ujjawal-khare, @wadhah101, @nikitavemuri, @akshay-anyscale, @srinathk10, @zcin, @dayshah, @dentiny, @LydiaXwQ, @matthewdeng, @JoshKarpel, @MortalHappiness, @sven1977, @omatthew98

Contributors

pcmoritz, tswast, and 32 other contributors

Assets 2

23 Jan 10:02

aslonnie

ray-2.41.0

021baf7

Ray-2.41.0

Highlights

Major update of RLlib docs and example scripts for the new API stack.

Ray Libraries

Ray Data

🎉 New Features:

Expression support for filters (#49016)
Support partition_cols in write_parquet (#49411)
Feature: implement multi-directional sort over Ray Data datasets (#49281)

💫 Enhancements:

Use dask 2022.10.2 (#48898)
Clarify schema validation error (#48882)
Raise ValueError when the data sort key is None (#48969)
Provide more messages when webdataset format is error (#48643)
Upgrade Arrow version from 17 to 18 (#48448)
Update hudi version to 0.2.0 (#48875)
webdataset: expand JSON objects into individual samples (#48673)
Support passing kwargs to map tasks. (#49208)
Add ExecutionCallback interface (#49205)
Add seed for read files (#49129)
Make select_columns and rename_columns use Project operator (#49393)

🔨 Fixes:

Fix partial function name parsing in map_groups (#48907)
Always launch one task for read_sql (#48923)
Reimplement of fix memory pandas (#48970)
webdataset: flatten return args (#48674)
Handle numpy > 2.0.0 behaviour in _create_possibly_ragged_ndarray (#48064)
Fix DataContext sealing for multiple datasets. (#49096)
Fix to_tf for List types (#49139)
Fix type mismatch error while mapping nullable column (#49405)
Datasink: support passing write results to on_write_completes (#49251)
Fix groupby hang when value contains np.nan (#49420)
Fix bug where file_extensions doesn't work with compound extensions (#49244)
Fix map operator fusion when concurrency is set (#49573)

Ray Train

🎉 New Features:

Output JSON structured log files for system and application logs (#49414)
Add support for AMD ROCR_VISIBLE_DEVICES (#49346)

💫 Enhancements:

Implement Train Tune API Revamp REP (#49376, #49467, #49317, #49522)

🏗 Architecture refactoring:

LightGBM: Rewrite get_network_params implementation (#49019)

Ray Tune

🎉 New Features:

Update optuna_search to allow users to configure optuna storage (#48547)

🏗 Architecture refactoring:

Make changes to support Train Tune API Revamp REP (#49308, #49317, #49519)

Ray Serve

💫 Enhancements:

Improved request_id generation to reduce proxy CPU overhead (#49537)
Tune GC threshold by default in proxy (#49720)
Use pickle.dumps for faster serialization from proxy to replica (#49539)

🔨 Fixes:

Handle nested ‘=’ in serve run arguments (#49719)
Fix bug when ray.init() is called multiple times with different runtime_envs (#49074)

🗑️ Deprecations:

Adds a warning that the default behavior for sync methods will change in a future release. They will be run in a threadpool by default. You can opt into this behavior early by setting RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1. (#48897)

RLlib

🎉 New Features:

Add support for external Envs to new API stack: New example script and custom tcp-capable EnvRunner. (#49033)

💫 Enhancements:

Offline RL:
- Add sequence sampling to EpisodeReplayBuffer. (#48116)
- Allow incomplete SampleBatch data and fully compressed observations. (#48699)
- Add option to customize OfflineData. (#49015)
- Enable offline training without specifying an environment. (#49041)
- Various fixes: #48309, #49194, #49195
APPO/IMPALA acceleration (new API stack):
- Add support for AggregatorActors per Learner. (#49284)
- Auto-sleep time AND thread-safety for MetricsLogger. (#48868)
- Activate APPO cont. actions release- and CI tests (HalfCheetah-v1 and Pendulum-v1 new in tuned_examples). (#49068)
- Add "burn-in" period setting to the training of stateful RLModules. (#49680)
Callbacks API: Add support for individual lambda-style callbacks. (#49511)
Other enhancements: #49687, #49714, #49693, #49497, #49800, #49098

📖 Documentation:

New example scripts:
- How to write a custom algorithm (VPG) from scratch. (#49536)
- How to customize an offline data pipeline. (#49046)
- GPUs on EnvRunners. (#49166)
- Hierarchical training. (#49127)
- Async gym vector env. (#49527)
- Other fixes and enhancements: #48988, #49071
New/rewritten html pages:
- Rewrite checkpointing page. (#49504)
- New scaling guide. (#49528)
- New callbacks page. (#49513)
- Rewrite RLModule page. (#49387)
- New AlgorithmConfig page and redo package_ref page for algo configs. (#49464)
- Rewrite offline RL page. (#48818)
- Rewrite “key concepts" rst page. (#49398)
- Rewrite RL environments pages. (#49165, #48542)
- Fixes and enhancements: #49465, #49037, #49304, #49428, #49474, #49399, #49713, #49518

🔨 Fixes:

Add on_episode_created callback to SingleAgentEnvRunner. (#49487)
Fix train_batch_size_per_learner problems. (#49715)
Various other fixes: #48540, #49363, #49418, #49191

🏗 Architecture refactoring:

RLModule: Introduce Default[algo]RLModule classes (#49366, #49368)
Remove RLlib dependencies from setup.py; add ormsgpack (#49489)

🗑️ Deprecations:

#49488, #49144

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

Add task_name, task_function_name and actor_name in Structured Logging (#48703)
Support redis/valkey authentication with username (#48225)
Add v6e TPU Head Resource Autoscaling Support (#48201)
compiled graphs: Support all driver and actor read combinations (#48963)
compiled graphs: Add ascii based CG visualization (#48315)
compiled graphs: Add ray[cg] pip install option (#49220)
Allow uv cache at installation (#49176)
Support != Filter in GCS for Task State API (#48983)
compiled graphs: Add CPU-based NCCL communicator for development (#48440)
Support gcs and raylet log rotation (#48952)
compiled graphs: Support nsight.nvtx profiling (#49392)

🔨 Fixes:

autoscaler: Health check logs are not visible in the autoscaler container's stdout (#48905)
Only publish WORKER_OBJECT_EVICTION when the object is out of scope or manually freed (#47990)
autoscaler: Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state (#48909)
autoscaler: Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 (#48519)
compiled graphs: Fix the missing dependencies when num_returns is used (#49118)
autoscaler: Fuse scaling requests together to avoid overloading the Kubernetes API server (#49150)
Fix bug to support S3 pre-signed url for .whl file (#48560)
Fix data race on gRPC client context (#49475)
Make sure draining node is not selected for scheduling (#49517)

Ray Clusters

💫 Enhancements:

Azure: Enable accelerated networking as a flag in azure vms (#47988)

📖 Documentation:

Kuberay: Logging: Add Fluent Bit DaemonSet and Grafana Loki to "Persist KubeRay Operator Logs" (#48725)
Kuberay: Logging: Specify the Helm chart version in "Persist KubeRay Operator Logs" (#48937)

Dashboard

💫 Enhancements:

Add instance variable to many default dashboard graphs (#49174)
Display duration in milliseconds if under 1 second. (#49126)
Add RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49353)
Document about the RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49700)

🏗 Architecture refactoring:

Move memray dependency from default to observability (#47763)
Move StateHead's methods into free functions. (#49388)

Thanks

@raulchen, @alanwguo, @omatthew98, @xingyu-long, @tlinkin, @yantzu, @alexeykudinkin, @andrewsykim, @win5923, @csy1204, @dayshah, @richardliaw, @stephanie-wang, @gueraf, @rueian, @davidxia, @fscnick, @wingkitlee0, @KPostOffice, @GeneDer, @MengjinYan, @simonsays1980, @pcmoritz, @petern48, @kashiwachen, @pfldy2850, @zcin, @scottjlee, @Akhil-CM, @Jay-ju, @JoshKarpel, @edoakes, @ruisearch42, @gorloffslava, @jimmyxie-figma, @bthananjeyan, @sven1977, @bnorick, @jeffreyjeffreywang, @ravi-dalal, @matthewdeng, @angelinalg, @ivanthewebber, @rkooo567, @srinathk10, @maresb, @gvspraveen, @akyang-anyscale, @mimiliaogo, @bveeramani, @ryanaoleary, @kevin85421, @richardsliu, @hartikainen, @coltwood93, @mattip, @Superskyyy, @justinvyu, @hongpeng-guo, @ArturNiederfahrenhorst, @jecsand838, @Bye-legumes, @hcc429, @WeichenXu123, @martinbomio, @HollowMan6, @MortalHappiness, @dentiny, @zhe-thoughts, @anyadontfly, @smanolloff, @richo-anyscale, @khluu, @xushiyan, @rynewang, @japneet-anyscale, @jjyao, @sumanthratna, @saihaj, @aslonnie

Many thanks to all those who contributed to this release!

Contributors

pcmoritz, bnorick, and 78 other contributors

Assets 2

04 Dec 00:01

dayshah

ray-2.40.0

22541c3

Ray-2.40.0

Ray Libraries

Ray Data

🎉 New Features:

Added read_hudi (#46273)

💫 Enhancements:

Improved performance of DelegatingBlockBuilder (#48509)
Improved memory accounting of pandas blocks (#46939)

🔨 Fixes:

Fixed bug where you can’t specify a schema with write_parquet (#48630)
Fixed bug where to_pandas errors if your dataset contains Arrow and pandas blocks (#48583)
Fixed bug where map_groups doesn’t work with pandas data (#48287)
Fixed bug where write_parquet errors if your data contains nullable fields (#48478)
Fixed bug where “Iteration Blocked Time” charts looks incorrect (#48618)
Fixed bug where unique fails with null values (#48750)
Fixed bug where “Rows Outputted” is 0 in the Data dashboard (#48745)
Fixed bug where methods like drop_columns cause spilling (#48140)
Fixed bug where async map tasks hang (#48861)

🗑️ Deprecations:

Deprecated read_parquet_bulk #48691
Deprecated iter_tf_batches #48693
Deprecated meta_provider parameter of read functions (#48690)
Deprecated to_torch (#48692)

Ray Train

🔨 Fixes:

Fix StartTracebackWithWorkerRank serialization (#48548)

📖 Documentation:

Add example for fine-tuning Llama3.1 with AWS Trainium (#48768)

Ray Tune

🔨 Fixes:

Remove the clear_checkpoint function during Trial restoration error handling. (#48532)

Ray Serve

🎉 New Features:

Initial version of local_testing_mode (#48477)

💫 Enhancements:

Handle multiple changed objects per LongPollHost.listen_for_change RPC (#48803)
Add more nuanced checks for http proxy status errors (#47896)
Improve replica access log messages to include HTTP status info and better resemble standard log format (#48819)
Propagate replica constructor error to deployment status message and print num retries left (#48531)

🔨 Fixes:

Pending requests that are cancelled before they were assigned to a replica now also return a serve.RequestCancelledError (#48496)

RLlib

💫 Enhancements:

Release test enhancements. (#45803, #48681)
Make opencv-python-headless default over opencv-python (#48776 )
Reverse learner queue behavior of IMPALA/APPO (consume oldest batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)

🔨 Fixes:

Fix torch scheduler stepping and reporting. (#48125 )
Fix accumulation of results over n training_step calls within same iteration (new API stack). (#48136)
Various other fixes: #48563, #48314, #48698, #48869.

📖 Documentation:

Upgrade examples script overview page (new API stack). (#48526 )
Enable RLlib + Serve example in CI and translate to new API stack. (#48687)

🏗 Architecture refactoring:

Switch new API stack on by default, APPO, IMPALA, BC, MARWIL, and CQL. (#48516, #48599 )
Various APPO enhancements (new API stack): Circular buffer (#48798), minor loss math fixes (#48800), target network update logic (#48802), smaller cleanups (#48844).
Remove rllib_contrib from repo. (#48565 )

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

[Core] uv runtime env support (#48479, #48486, #48611, #48619, #48632, #48634, #48637, #48670, #48731)
[Core] GCS FT with redis sentinel (#47335)

💫 Enhancements:

[CompiledGraphs] Refine schedule visualization (#48594)

🔨 Fixes:

[CompiledGraphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs (#48463)
[Core] Fix Ascend NPU discovery to support 8+ cards per node (#48543)
[Core] Make Placement Group Wildcard and Indexed Resource Assignments Consistent (#48088)
[Core] Stop the GRPC server before Shut down the Object Store (#48572)

Ray Clusters

🔨 Fixes:

[KubeRay]: Fix ConnectionError on Autoscaler CR lookups in K8s clusters with custom DNS for Kubernetes API. (#48541)

Dashboard

💫 Enhancements:

Add global UTC timezone button in navbar with local storage (#48510)
Add memory graphs optimized for OOM debugging (#48530)
Improve tasks/actors metric naming and add graph for running tasks (#48528)
add actor pid to dashboard (#48791)

🔨 Fixes:

Fix Placement Group Table table cells overflow (#47323)
Fix Rows Outputted being zero on Ray Data Dashboard (#48745)
fix confusing dataset operator name (#48805)

Thanks

Thanks to all those who contributed to this release!
@rynewang, @rickyyx, @bveeramani, @marwan116, @simonsays1980, @dayshah, @dentiny, @KepingYan, @mimiliaogo, @kevin85421, @SeaOfOcean, @stephanie-wang, @mohitjain2504, @azayz, @xushiyan, @richardliaw, @can-anyscale, @xingyu-long, @kanwang, @aslonnie, @MortalHappiness, @jjyao, @SumanthRH, @matthewdeng, @alexeykudinkin, @sven1977, @raulchen, @andrewsykim, @zcin, @nadongjun, @hongpeng-guo, @miguelteixeiraa, @saihaj, @khluu, @ArturNiederfahrenhorst, @ryanaoleary, @ltbringer, @pcmoritz, @JoshKarpel, @akyang-anyscale, @frances720, @BeingGod, @edoakes, @Bye-legumes, @Superskyyy, @liuxsh9, @MengjinYan, @ruisearch42, @scottjlee, @angelinalg

Contributors

pcmoritz, alexeykudinkin, and 48 other contributors

Assets 2

13 Nov 19:50

jjyao

ray-2.39.0

5a6c335

Ray-2.39.0

Ray Libraries

Ray Data

🔨 Fixes:

Fixed InvalidObjectError edge case with Dataset.split() (#48130)
Made Concatenator preserve order of concatenated columns (#47997)

📖 Documentation:

Improved documentation around Parquet column and predicate pushdown (#48095)
Marked num_rows_per_file parameter of write APIs as experimental (#48208)
One hot encoder now returns an encoded vector (#48173)
transform_batch no longer fails on missing columns (#48137)

🏗 Architecture refactoring:

Dataset.count() now uses a Count logical operator (#48126)

🗑 Deprecations:

Removed long-deprecated set_progress_bars (#48203)

Ray Train

🔨 Fixes:

Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Tune

🔨 Fixes:

Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Serve

💫 Enhancements:

Cancelled requests now return a serve.RequestCancelledError (#48444)
Exposed application source in app details model (#45522)

🔨 Fixes:

Basic HTTP deployments will now return “Internal Server Error” instead of a traceback to match FastAPI behavior (#48491)
Fixed an issue where high values of max_ongoing_requests couldn’t be reached due to an interaction with core’s max_concurrency (#48274)
Fixed an edge case where pending requests were not canceled properly (#47873)
Removed deprecated API to set route_prefix per-deployment (#48223)

📖 Documentation:

Added ProxyStatus model to reference docs (#48299)
Added ApplicationStatus model to reference docs (#48220)

RLlib

💫 Enhancements:

Upgrade to gymnasium==1.0.0 (support new API for vector env resets). (#48443, #45328)
Add off-policy'ness metric to new API stack. (#48227)
Validate episodes before adding them to the buffer. (#48083)

📖 Documentation:

New example script for custom metrics on EnvRunners (using MetricsLogger API on the new stack). (#47969)
Do-over: New RLlib index page. (#48285, #48442)
Do-over: Example script for AutoregressiveActionsRLM. (#47972)

🏗 Architecture refactoring:

New API stack on by default for PPO. (#48284)
Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). (#48286)

🔨 Fixes:

Various bug and CI fixes: #47993, #48450, #48213
Cleanup evaluation folder (#48493)

Ray Core

🎉 New Features:

[CompiledGraphs] Support all reduce collective in aDAG (#47621)
[CompiledGraphs] Add visualization of compiled graphs (#47958)

💫 Enhancements:

[Distributed Debugger] The distributed debugger can now be used without having to set RAY_DEBUG=1, see #48301 and https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html. If you want to restore the previous behavior and use the CLI based debugger, you need to set RAY_DEBUG=legacy.
[Core] Add more infos to each breakpoint for ray debug CLI (#48202)
[Core] Add demands info to GCS debug state (#48115)
[Core] Add PENDING_ACTOR_TASK_ARGS_FETCH and PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus (#48242)
[Core] Add metrics ray_io_context_event_loop_lag_ms. (#47989)
[Core] Better log format when show the disk size (#46869)
[CompiledGraphs] Support asyncio.gather on multiple CompiledDAGFutures (#47860)
[CompiledGraphs] Raise an exception if a leaf node is found during compilation (#47757)

🔨 Fixes:

[Core] Posts CoreWorkerMemoryStore callbacks onto io_context to fix deadlock (#47833)

Dashboard

🔨 Fixes:

[Dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count (#48229)

Thanks

Many thanks to all those who contributed to this release!

@akyang-anyscale, @rkooo567, @bveeramani, @dayshah, @martinbomio, @khluu, @justinvyu, @slfan1989, @alexeykudinkin, @simonsays1980, @vigneshka, @ruisearch42, @rynewang, @scottjlee, @jjyao, @JoshKarpel, @win5923, @MengjinYan, @MortalHappiness, @ujjawal-khare-27, @zcin, @ccoulombe, @Bye-legumes, @dentiny, @stephanie-wang, @LeoLiao123, @dengwxn, @richo-anyscale, @pcmoritz, @sven1977, @omatthew98, @GeneDer, @srinathk10, @can-anyscale, @edoakes, @kevin85421, @aslonnie, @jeffreyjeffreywang, @ArturNiederfahrenhorst

Contributors

pcmoritz, alexeykudinkin, and 37 other contributors

Assets 2

23 Oct 21:57

aslonnie

ray-2.38.0

385ee46

Ray-2.38.0

Ray Libraries

Ray Data

🎉 New Features:

Add Dataset.rename_columns (#47906)
Basic structured logging (#47210)

💫 Enhancements:

Add partitioning parameter to read_parquet (#47553)
Add SERVICE_UNAVAILABLE to list of retried transient errors (#47673)
Re-phrase the streaming executor current usage string (#47515)
Remove ray.kill in ActorPoolMapOperator (#47752)
Simplify and consolidate progress bar outputs (#47692)
Refactor OpRuntimeMetrics to support properties (#47800)
Refactor plan_write_op and Datasinks (#47942)
Link PhysicalOperator to its LogicalOperator (#47986)
Allow specifying both num_cpus and num_gpus for map APIs (#47995)
Allow specifying insertion index when registering custom plan optimization Rules (#48039)
Adding in better framework for substituting logging handlers (#48056)

🔨 Fixes:

Fix bug where Ray Data incorrectly emits progress bar warning (#47680)
Yield remaining results from async map_batches (#47696)
Fix event loop mismatch with async map (#47907)
Make sure num_gpus provide to Ray Data is appropriately passed to ray.remote call (#47768)
Fix unequal partitions when grouping by multiple keys (#47924)
Fix reading multiple parquet files with ragged ndarrays (#47961)
Removing unneeded test case (#48031)
Adding in better json checking in test logging (#48036)
Fix bug with inserting custom optimization rule at index 0 (#48051)
Fix logging output from write_xxx APIs (#48096)

📖 Documentation:

Add docs section for Ray Data progress bars (#47804)
Add reference to parquet predicate pushdown (#47881)
Add tip about how to understand map_batches format (#47394)

Ray Train

🏗 Architecture refactoring:

Remove deprecated mosaic and sklearn trainer code (#47901)

Ray Tune

🔨 Fixes:

Fix WandbLoggerCallback to reuse actors upon restore (#47985)

Ray Serve

🔨 Fixes:

Stop scheduling task early when requests have been canceled (#47847)

RLlib

🎉 New Features:

Enable cloud checkpointing. (#47682)

💫 Enhancements:

PPO on new API stack now shuffles batches properly before each epoch. (#47458)
Other enhancements: #47705, #47501, #47731, #47451, #47830, #47970, #47157

🔨 Fixes:

Fix spot node preemption problem (RLlib now run stably with EnvRunner workers on spot nodes) (#47940)
Fix action masking example. (#47817)
Various other fixes: #47973, #46721, #47914, #47880, #47304, #47686

🏗 Architecture refactoring:

Switch on new API stack by default for SAC and DQN. (#47217)
Remove Tf support on new API stack for PPO/IMPALA/APPO (only DreamerV3 on new API stack remains with tf now). (#47892)
Discontinue support for "hybrid" API stack (using RLModule + Learner, but still on RolloutWorker and Policy) (#46085)
RLModule (new API stack) refinements: #47884, #47885, #47889, #47908, #47915, #47965, #47775

📖 Documentation:

Add new API stack migration guide. (#47779)
New API stack example script: BC pre training, then PPO finetuning using same RLModule class. (#47838)
New API stack: Autoregressive actions example. (#47829)
Remove old API stack connector docs entirely. (#47778)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

CompiledGraphs: support multi readers in multi node when DAG is created from an actor (#47601)

💫 Enhancements:

Add a flag to raise exception for out of band serialization of ObjectRef (#47544)
Store each GCS table in its own Redis Hash (#46861)
Decouple create worker vs pop worker request. (#47694)
Add metrics for GCS jobs (#47793)

🔨 Fixes:

Fix broken dashboard cluster page when there are dead nodes (#47701)
Fix the ray_tasks{State="PENDING_ARGS_FETCH"} metric counting (#47770)
Separate the attempt_number with the task_status in memory summary and object list (#47818)
Fix object reconstruction hang on arguments pending creation (#47645)
Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end() (#47861)
Fix check failure RAY_CHECK(it != current_tasks_.end()); (#47659)

📖 Documentation:

KubeRay docs: Add docs for YuniKorn Gang scheduling #47850

Dashboard

💫 Enhancements:

Performance improvements for large scale clusters (#47617)

🔨 Fixes:

Placement group and required resources not showing correctly in dashboard (#47754)

Thanks

Many thanks to all those who contributed to this release!
@GeneDer, @rkooo567, @dayshah, @saihaj, @nikitavemuri, @bill-oconnor-anyscale, @WeichenXu123, @can-anyscale, @jjyao, @edoakes, @kekulai-fredchang, @bveeramani, @alexeykudinkin, @raulchen, @khluu, @sven1977, @ruisearch42, @dentiny, @MengjinYan, @Mark2000, @simonsays1980, @rynewang, @PatricYan, @zcin, @sofianhnaide, @matthewdeng, @dlwh, @scottjlee, @MortalHappiness, @kevin85421, @win5923, @aslonnie, @prithvi081099, @richardsliu, @milesvant, @omatthew98, @Superskyyy, @pcmoritz

Contributors

dlwh, pcmoritz, and 36 other contributors

Assets 2

Releases: ray-project/ray

Ray-2.45.0

Ray Core

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

Contributors

Ray-2.44.1

Ray-2.44.0

Release Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Contributors

Ray-2.43.0

Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Contributors

Ray-2.42.1

Ray Data

Ray-2.42.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Thanks

Contributors

Ray-2.41.0

Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Thanks

Contributors

Ray-2.40.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

Thanks

Contributors

Ray-2.39.0

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core

Dashboard

Thanks