Skip to content

Releases: ray-project/ray

Ray-2.44.1

27 Mar 17:13
daca7b2
Compare
Choose a tag to compare

Under screen-lit skies
A ray of bliss in each patch
Joy at any scale

Ray-2.44.0

21 Mar 05:15
36bed82
Compare
Choose a tag to compare

Release Highlights

  • This release features Ray Compiled Graph (beta). Ray Compiled Graph gives you a classic Ray Core-like API, but with (1) less than 50us system overhead for workloads that repeatedly execute the same task graph; and (2) native support for GPU-GPU communication via NCCL. Ray Compiled Graph APIs simplify high-performance multi-GPU workloads such as LLM inference and training. The beta release refines the API, enhances stability, and adds or improves features like visualization, profiling and experimental GPU compute/computation overlap. For more information, refer to Ray documentation: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html
  • The experimental Ray Workflows library has been deprecated and will be removed in a future version of Ray. Ray Workflows has been marked experimental since its inception and hasn’t been maintained due to the Ray team focusing on other priorities. If you are using Ray Workflows, we recommend pinning your Ray version to 2.44.

Ray Libraries

Ray Data

🎉 New Features:

  • Add Iceberg write support through pyiceberg (#50590)
  • [LLM] Various feature enhancements to Ray Data LLM, including LoRA support #50804 and structured outputs #50901

💫 Enhancements:

  • Add dataset/operator state, progress, total metrics (#50770)
  • Make chunk combination threshold configurable (#51200)
  • Store average memory use per task in OpRuntimeMetrics (#51126)
  • Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks (#51238)
  • Append-mode API for preprocessors -- #50848, #50847, #50642, #50856, #50584. Note that vectorizers and hashers now output a single column instead 1 column per feature. In the near future, we will be graduating preprocessors to beta.

🔨 Fixes:

  • Fixing Map Operators to avoid unconditionally overriding generator's back-pressure configuration (#50900)
  • Fix filter expr equating negative numbers (#50932)
  • Fix error message for override_num_blocks when reading from a HuggingFace Dataset (#50998)
  • Make num_blocks in repartition optional (#50997)
  • Always pin the seed when doing file-based random shuffle (#50924)
  • Fix StandardScaler to handle NaN stats (#51281)

Ray Train

🎉 New Features:

💫 Enhancements:

  • Folded v2.XGBoostTrainer API into the public trainer class as an alternate constructor (#50045)
  • Created a default ScalingConfig if one is not provided to the trainer (#51093)
  • Improved TrainingFailedError message (#51199)
  • Utilize FailurePolicy factory (#51067)

🔨 Fixes:

  • Fixed trainer import deserialization when captured within a Ray task (#50862)
  • Fixed serialize import test for Python 3.12 (#50963)
  • Fixed RunConfig deprecation message in Tune being emitted in trainer.fit usage (#51198)

📖 Documentation:

  • [Train V2] Updated API references (#51222)
  • [Train V2] Updated persistent storage guide (#51202)
  • [Train V2] Updated user guides for metrics, checkpoints, results, and experiment tracking (#51204)
  • [Train V2] Added updated Train + Tune user guide (#51048)
  • [Train V2] Added updated fault tolerance user guide (#51083)
  • Improved HF Transformers example (#50896)
  • Improved Train DeepSpeed example (#50906)
  • Use correct mean and standard deviation norm values in image tutorials (#50240)

🏗 Architecture refactoring:

  • Deprecated Torch AMP wrapper utilities (#51066)
  • Hid private functions of train context to avoid abuse (#50874)
  • Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
  • Moved library usage tests out of core (#51161)

Ray Tune

📖 Documentation:

  • Various improvements to Tune Pytorch CIFAR tutorial (#50316)
  • Various improvements to the Ray Tune XGBoost tutorial (#50455)
  • Various enhancements to Tune Keras example (#50581)
  • Minor improvements to Hyperopt tutorial (#50697)
  • Various improvements to LightGBM tutorial (#50704)
  • Fixed non-runnable Optuna tutorial (#50404)
  • Added documentation for Asynchronous HyperBand Example in Tune (#50708)
  • Replaced reuse actors example with a fuller demonstration (#51234)
  • Fixed broken PB2/RLlib example (#51219)
  • Fixed typo and standardized equations across the two APIs (#51114)
  • Improved PBT example (#50870)
  • Removed broken links in documentation (#50995, #50996)

🏗 Architecture refactoring:

  • Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
  • Moved library usage tests out of core (#51161)

Ray Serve

🎉 New Features:

  • Faster bulk imperative Serve Application deploys (#49168)
  • [LLM] Add gen-config (#51235)

💫 Enhancements:

  • Clean up shutdown behavior of serve (#51009)
  • Add additional_log_standard_attrs to serve logging config (#51144)
  • [LLM] remove asyncache and cachetools from dependencies (#50806)
  • [LLM] remove backoff dependency (#50822)
  • [LLM] Remove asyncio_timeout from ray[llm] deps on python<3.11 (#50815)
  • [LLM] Made JSON validator a singleton and jsonref packages lazy imported (#50821)
  • [LLM] Reuse AutoscalingConfig and DeploymentConfig from Serve (#50871)
  • [LLM] Use pyarrow FS for cloud remote storage interaction (#50820)
  • [LLM] Add usage telemetry for serve.llm (#51221)

🔨 Fixes:

  • Exclude redirects from request error count (#51130)
  • [LLM] Fix the wrong device_capability issue in vllm on quantized models (#51007)
  • [LLM] add gen-config related data file to the package (#51347)

📖 Documentation:

  • [LLM] Fix quickstart serve LLM docs (#50910)
  • [LLM] update build_openai_app to include yaml example (#51283)
  • [LLM] remove old vllm+serve doc (#51311)

RLlib

💫 Enhancements:

  • APPO/IMPALA accelerate:
    • LearnerGroup should not pickle remote functions on each update-call; Refactor LearnerGroup and Learner APIs. (#50665)
    • EnvRunner sync enhancements. (#50918)
    • Various other speedups: #51302, #50923, #50919, #50791
  • Unify namings for actor managers' outstanding in-flight requests metrics. (#51159)
  • Add timers to env step, forward pass, and complete connector pipelines runs. (#51160)

🔨 Fixes:

  • Multi-agent env vectorization:
    • Fix MultiAgentEnvRunner env check bug. (#50891)
    • Add single_action_space and single_observation_space to VectorMultiAgentEnv. (#51096)
  • Other fixes: #51255, #50920, #51369

📖 Documentation:

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Enhanced uv support (#51233)

💫 Enhancements:

  • Made infeasible task errors much more obvious (#45909)
  • Log rotation for workers, runtime env agent, and dashboard agent (#50759, #50877, #50909)
  • Support customizing gloo timeout (#50223)
  • Support torch profiling in Compiled Graph (#51022)
  • Change default tensor deserialization in Compiled Graph (#50778)
  • Use current node id if no node is specified on ray drain-node (#51134)

🔨 Fixes:

  • Fixed an issue where the raylet continued to have high CPU overhead after a job was terminated ([...
Read more

Ray-2.43.0

27 Feb 19:57
744eaa9
Compare
Choose a tag to compare

Highlights

  • This release features new modules in Ray Serve and Ray Data for integration with large language models, marking the first step of addressing #50639. Existing Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine. In this release, we offer APIs for both batch inference and serving of LLMs within Ray in ray.data.llm and ray.serve.llm. See the below notes for more details. These APIs are marked as alpha -- meaning they may change in future releases without a deprecation period.
  • Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the RAY_TRAIN_V2_ENABLED=1 environment variable. See the migration guide for more information.
  • A new integration with uv run that allows easily specifying Python dependencies for both driver and workers in a consistent way and enables quick iterations for development of Ray applications (#50160, 50462), check out our blog post

Ray Libraries

Ray Data

🎉 New Features:

  • Ray Data LLM: We are introducing a new module in Ray Data for batch inference with LLMs (currently marked as alpha). It offers a new Processor abstraction that interoperates with existing Ray Data pipelines. This abstraction can be configured two ways:
    • Using the vLLMEngineProcessorConfig, which configures vLLM to load model replicas for high throughput model inference
    • Using the HttpRequestProcessorConfig, which sends HTTP requests to an OpenAI-compatible endpoint for inference.
    • Documentation for these features can be found here.
  • Implement accurate memory accounting for UnionOperator (#50436)
  • Implement accurate memory accounting for all-to-all operations (#50290)

💫 Enhancements:

  • Support class constructor args for filter() (#50245)
  • Persist ParquetDatasource metadata. (#50332)
  • Rebasing ShufflingBatcher onto try_combine_chunked_columns (#50296)
  • Improve warning message if required dependency isn't installed (#50464)
  • Move data-related test logic out of core tests directory (#50482)
  • Pass executor as an argument to ExecutionCallback (#50165)
  • Add operator id info to task+actor (#50323)
  • Abstracting common methods, removing duplication in ArrowBlockAccessor, PandasBlockAccessor (#50498)
  • Warn if map UDF is too large (#50611)
  • Replace AggregateFn with AggregateFnV2, cleaning up Aggregation infrastructure (#50585)
  • Simplify Operator.repr (#50620)
  • Adding in TaskDurationStats and on_execution_step callback (#50766)
  • Print Resource Manager stats in release tests (#50801)

🔨 Fixes:

  • Fix invalid escape sequences in grouped_data.py docstrings (#50392)
  • Deflake test_map_batches_async_generator (#50459)
  • Avoid memory leak with pyarrow.infer_type on datetime arrays (#50403)
  • Fix parquet partition cols to support tensors types (#50591)
  • Fixing aggregation protocol to be appropriately associative (#50757)

📖 Documentation:

  • Remove "Stable Diffusion Batch Prediction with Ray Data" example (#50460)

Ray Train

🎉 New Features:

  • Ray Train V2 is available to try starting in Ray 2.43! Run your next Ray Train job with the RAY_TRAIN_V2_ENABLED=1 environment variable. See the migration guide for more information.

💫 Enhancements:

  • Add a training ingest benchmark release test (#50019, #50299) with a fault tolerance variant (#50399)
  • Add telemetry for Trainer usage in V2 (#50321)
  • Add pydantic as a ray[train] extra install (#46682)
  • Add state tracking to train v2 to make run status, run attempts, and training worker metadata observable (#50515)

🔨 Fixes:

  • Increase doc test parallelism (#50326)
  • Disable TF test for py312 (#50382)
  • Increase test timeout to deflake (#50796)

📖 Documentation:

  • Add missing xgboost pip install in example (#50232)

🏗 Architecture refactoring:

Ray Tune

🔨 Fixes:

  • Fix worker node failure test (#50109)

📖 Documentation:

  • Update all doc examples off of ray.train imports (#50458)
  • Update all ray/tune/examples off of ray.train imports (#50435)
  • Fix typos in persistent storage guide (#50127)
  • Remove Binder notebook links in Ray Tune docs (#50621)

🏗 Architecture refactoring:

  • Update RLlib to use ray.tune imports instead of ray.air and ray.train (#49895)

Ray Serve

🎉 New Features:

  • Ray Serve LLM: We are introducing a new module in Ray Serve to easily integrate open source LLMs in your Ray Serve deployment, currently marked as alpha. This opens up a powerful capability of composing complex applications with multiple LLMs, which is a use case in emerging applications like agentic workflows. Ray Serve LLM offers a couple core components, including:
    • VLLMService: A prebuilt deployment that offers a full-featured vLLM engine integration, with support for features such as LoRA multiplexing and multimodal language models.
    • LLMRouter: An out-of-the-box OpenAI compatible model router that can route across multiple LLM deployments.
    • Documentation can be found at https://docs.ray.io/en/releases-2.43.0/serve/llm/overview.html

💫 Enhancements:

  • Add required_resources to REST API (#50058)

🔨 Fixes:

  • Fix batched requests hanging after cancellation (#50054)
  • Properly propagate backpressure error (#50311)

RLlib

🎉 New Features:

  • Added env vectorization support for multi-agent (new API stack). (#50437)

💫 Enhancements:

  • APPO/IMPALA various acceleration efforts. Reached 100k ts/sec on Atari benchmark with 400 EnvRunners and 16 (multi-node) GPU Learners: #50760, #50162, #50249, #50353, #50368, #50379, #50440, #50477, #50527, #50528, #50600, #50309
  • Offline RL:
    • Remove all weight synching to eval_env_runner_group from the training steps. (#50057)
    • Enable single-learner/multi-learner GPU training. (#50034)
    • Remove reference to MARWILOfflinePreLearner in OfflinePreLearner docstring. (#50107)
    • Add metrics to multi-agent replay buffers. (#49959)

🔨 Fixes:

  • Fix SPOT preemption tolerance for large AlgorithmConfig: Pass by reference to RolloutWorker...
Read more

Ray-2.42.1

11 Feb 22:21
c2e38f7
Compare
Choose a tag to compare

Ray Data

🔨 Fixes:

  • Fixes incorrect assertion (#50210)

Ray-2.42.0

05 Feb 00:42
637116a
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Added read_audio and read_video (#50016)

💫 Enhancements:

  • Optimized multi-column groupbys (#45667)
  • Included Ray user-agent in BigQuery client construction (#49922)

🔨 Fixes:

  • Fixed bug that made read tasks non-deterministic (#49897)

🗑️ Deprecations:

  • Deprecated num_rows_per_file in favor of min_rows_per_file (#49978)

Ray Train

💫 Enhancements:

  • Add Train v2 user-facing callback interface (#49819)
  • Add TuneReportCallback for propagating intermediate Train results to Tune (#49927)

Ray Tune

📖 Documentation:

  • Fix BayesOptSearch docs (#49848)

Ray Serve

💫 Enhancements:

  • Cache metrics in replica and report on an interval (#49971)
  • Cache expensive calls to inspect.signature (#49975)
  • Remove extra pickle serialization for gRPCRequest (#49943)
  • Shared LongPollClient for Routers (#48807)
  • DeploymentHandle API is now stable (#49840)

🔨 Fixes:

  • Fix batched requests hanging after request cancellation bug (#50054)

RLlib

💫 Enhancements:

  • Add metrics to replay buffers. (#49822)
  • Enhance node-failure tolerance (new API stack). (#50007)
  • MetricsLogger cleanup throughput logic. (#49981)
  • Split AddStates... connectors into 2 connector pieces (AddTimeDimToBatchAndZeroPad and AddStatesFromEpisodesToBatch) (#49835)

🔨 Fixes:

  • Old API stack IMPALA/APPO: Re-introduce mixin-replay-buffer pass, even if replay-ratio=0 (fixes a memory leak). (#49964)
  • Fix MetricsLogger race conditions. (#49888)
  • APPO/IMPALA: Bug fix for > 1 Learner actor. (#49849)

📖 Documentation:

  • New MetricsLogger API rst page. (#49538)
  • Move "new API stack" info box right below page titles for better visibility. (#49921)
  • Add example script for how to log custom metrics in training_step(). (#49976)
  • Enhance/redo autoregressive action distribution example. (#49967)
  • Make the "tiny CNN" example RLModule run with APPO (by implementing TargetNetAPI) (#49825)

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

  • Only get single node info rather then all when needed (#49727)
  • Introduce with_tensor_transport API (#49753)

🔨 Fixes:

  • Fix tqdm manager thread safe #50040

Ray Clusters

🔨 Fixes:

  • Fix token expiration for ray autoscaler (#48481)

Thanks

Thank you to everyone who contributed to this release! 🥳
@wingkitlee0, @saihaj, @win5923, @justinvyu, @kevin85421, @edoakes, @cristianjd, @rynewang, @richardliaw, @LeoLiao123, @alexeykudinkin, @simonsays1980, @aslonnie, @ruisearch42, @pcmoritz, @fscnick, @bveeramani, @mattip, @till-m, @tswast, @ujjawal-khare, @wadhah101, @nikitavemuri, @akshay-anyscale, @srinathk10, @zcin, @dayshah, @dentiny, @LydiaXwQ, @matthewdeng, @JoshKarpel, @MortalHappiness, @sven1977, @omatthew98

Ray-2.41.0

23 Jan 10:02
021baf7
Compare
Choose a tag to compare

Highlights

  • Major update of RLlib docs and example scripts for the new API stack.

Ray Libraries

Ray Data

🎉 New Features:

  • Expression support for filters (#49016)
  • Support partition_cols in write_parquet (#49411)
  • Feature: implement multi-directional sort over Ray Data datasets (#49281)

💫 Enhancements:

  • Use dask 2022.10.2 (#48898)
  • Clarify schema validation error (#48882)
  • Raise ValueError when the data sort key is None (#48969)
  • Provide more messages when webdataset format is error (#48643)
  • Upgrade Arrow version from 17 to 18 (#48448)
  • Update hudi version to 0.2.0 (#48875)
  • webdataset: expand JSON objects into individual samples (#48673)
  • Support passing kwargs to map tasks. (#49208)
  • Add ExecutionCallback interface (#49205)
  • Add seed for read files (#49129)
  • Make select_columns and rename_columns use Project operator (#49393)

🔨 Fixes:

  • Fix partial function name parsing in map_groups (#48907)
  • Always launch one task for read_sql (#48923)
  • Reimplement of fix memory pandas (#48970)
  • webdataset: flatten return args (#48674)
  • Handle numpy > 2.0.0 behaviour in _create_possibly_ragged_ndarray (#48064)
  • Fix DataContext sealing for multiple datasets. (#49096)
  • Fix to_tf for List types (#49139)
  • Fix type mismatch error while mapping nullable column (#49405)
  • Datasink: support passing write results to on_write_completes (#49251)
  • Fix groupby hang when value contains np.nan (#49420)
  • Fix bug where file_extensions doesn't work with compound extensions (#49244)
  • Fix map operator fusion when concurrency is set (#49573)

Ray Train

🎉 New Features:

  • Output JSON structured log files for system and application logs (#49414)
  • Add support for AMD ROCR_VISIBLE_DEVICES (#49346)

💫 Enhancements:

🏗 Architecture refactoring:

  • LightGBM: Rewrite get_network_params implementation (#49019)

Ray Tune

🎉 New Features:

  • Update optuna_search to allow users to configure optuna storage (#48547)

🏗 Architecture refactoring:

Ray Serve

💫 Enhancements:

  • Improved request_id generation to reduce proxy CPU overhead (#49537)
  • Tune GC threshold by default in proxy (#49720)
  • Use pickle.dumps for faster serialization from proxy to replica (#49539)

🔨 Fixes:

  • Handle nested ‘=’ in serve run arguments (#49719)
  • Fix bug when ray.init() is called multiple times with different runtime_envs (#49074)

🗑️ Deprecations:

  • Adds a warning that the default behavior for sync methods will change in a future release. They will be run in a threadpool by default. You can opt into this behavior early by setting RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1. (#48897)

RLlib

🎉 New Features:

  • Add support for external Envs to new API stack: New example script and custom tcp-capable EnvRunner. (#49033)

💫 Enhancements:

  • Offline RL:
    • Add sequence sampling to EpisodeReplayBuffer. (#48116)
    • Allow incomplete SampleBatch data and fully compressed observations. (#48699)
    • Add option to customize OfflineData. (#49015)
    • Enable offline training without specifying an environment. (#49041)
    • Various fixes: #48309, #49194, #49195
  • APPO/IMPALA acceleration (new API stack):
    • Add support for AggregatorActors per Learner. (#49284)
    • Auto-sleep time AND thread-safety for MetricsLogger. (#48868)
    • Activate APPO cont. actions release- and CI tests (HalfCheetah-v1 and Pendulum-v1 new in tuned_examples). (#49068)
    • Add "burn-in" period setting to the training of stateful RLModules. (#49680)
  • Callbacks API: Add support for individual lambda-style callbacks. (#49511)
  • Other enhancements: #49687, #49714, #49693, #49497, #49800, #49098

📖 Documentation:

🔨 Fixes:

🏗 Architecture refactoring:

  • RLModule: Introduce Default[algo]RLModule classes (#49366, #49368)
  • Remove RLlib dependencies from setup.py; add ormsgpack (#49489)

🗑️ Deprecations:

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

  • Add task_name, task_function_name and actor_name in Structured Logging (#48703)
  • Support redis/valkey authentication with username (#48225)
  • Add v6e TPU Head Resource Autoscaling Support (#48201)
  • compiled graphs: Support all driver and actor read combinations (#48963)
  • compiled graphs: Add ascii based CG visualization (#48315)
  • compiled graphs: Add ray[cg] pip install option (#49220)
  • Allow uv cache at installation (#49176)
  • Support != Filter in GCS for Task State API (#48983)
  • compiled graphs: Add CPU-based NCCL communicator for development (#48440)
  • Support gcs and raylet log rotation (#48952)
  • compiled graphs: Support nsight.nvtx profiling (#49392)

🔨 Fixes:

  • autoscaler: Health check logs are not visible in the autoscaler container's stdout (#48905)
  • Only publish WORKER_OBJECT_EVICTION when the object is out of scope or manually freed (#47990)
  • autoscaler: Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state (#48909)
  • autoscaler: Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 (#48519)
  • compiled graphs: Fix the missing dependencies when num_returns is used (#49118)
  • autoscaler: Fuse scaling requests together to avoid overloading the Kubernetes API server (#49150)
  • Fix bug to support S3 pre-signed url for .whl file (#48560)
  • Fix data race on gRPC client context (#49475)
  • Make sure draining node is not selected for scheduling (#49517)

Ray Clusters

💫 Enhancements:

  • Azure: Enable accelerated networking as a flag in azure vms (#47988)

📖 Documentation:

  • Kuberay: Logging: Add Fluent Bit DaemonSet and Grafana Loki to "Persist KubeRay Operator Logs" (#48725)
  • Kuberay: Logging: Specify the Helm chart version in "Persist KubeRay Operator Logs" (#48937)

Dashboard

💫 Enhancements:

  • Add instance variable to many default dashboard graphs (#49174)
  • Display duration in milliseconds if under 1 second. (#49126)
  • Add RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49353)
  • Document about the RAY_PROMETHEUS_HEADERS env for carrying additional headers to Prometheus (#49700)

🏗 Architecture refactoring:

  • Move memray dependency from default to observability (#47763)
  • Move StateHead's methods into free functions. (#49388)

Thanks

@raulchen, @alanwguo, @omatthew98, @xingyu-long, @tlinkin, @yantzu, @alexeykudinkin, @andrewsykim, @win5923, @csy1204, @dayshah, @richardliaw, @stephanie-wang, @gueraf, @rueian, @davidxia, @fscnick, @wingkitlee0, @KPostOffice, @GeneDer, @MengjinYan, @simonsays1980, @pcmoritz, @petern48, @kashiwachen, @pfldy2850, @zcin, @scottjlee, @Akhil-CM, @Jay-ju, @JoshKarpel, @edoakes, @ruisearch42, @gorloffslava, @jimmyxie-figma, @bthananjeyan, @sven1977, @bnorick, @jeffreyjeffreywang, @ravi-dalal, @matthewdeng, @angelinalg, @ivanthewebber, @rkooo567, @srinathk10, @maresb, @gvspraveen, @akyang-anyscale, @mimiliaogo, @bveeramani, @ryanaoleary, @kevin85421, @richardsliu, @hartikainen, @coltwood93, @mattip, @Superskyyy, @justinvyu, @hongpeng-guo, @ArturNiederfahrenhorst, @jecsand838, @Bye-legumes, @hcc429, @WeichenXu123, @martinbomio, @HollowMan6, @MortalHappiness, @dentiny, @zhe-thoughts, @anyadontfly, @smanolloff, @richo-anyscale, @khluu, @xushiyan, @rynewang, @japneet-anyscale, @jjyao, @sumanthratna, @saihaj, @aslonnie

Many thanks to all those who contributed to this release!

Ray-2.40.0

04 Dec 00:01
22541c3
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

💫 Enhancements:

  • Improved performance of DelegatingBlockBuilder (#48509)
  • Improved memory accounting of pandas blocks (#46939)

🔨 Fixes:

  • Fixed bug where you can’t specify a schema with write_parquet (#48630)
  • Fixed bug where to_pandas errors if your dataset contains Arrow and pandas blocks (#48583)
  • Fixed bug where map_groups doesn’t work with pandas data (#48287)
  • Fixed bug where write_parquet errors if your data contains nullable fields (#48478)
  • Fixed bug where “Iteration Blocked Time” charts looks incorrect (#48618)
  • Fixed bug where unique fails with null values (#48750)
  • Fixed bug where “Rows Outputted” is 0 in the Data dashboard (#48745)
  • Fixed bug where methods like drop_columns cause spilling (#48140)
  • Fixed bug where async map tasks hang (#48861)

🗑️ Deprecations:

  • Deprecated read_parquet_bulk #48691
  • Deprecated iter_tf_batches #48693
  • Deprecated meta_provider parameter of read functions (#48690)
  • Deprecated to_torch (#48692)

Ray Train

🔨 Fixes:

  • Fix StartTracebackWithWorkerRank serialization (#48548)

📖 Documentation:

  • Add example for fine-tuning Llama3.1 with AWS Trainium (#48768)

Ray Tune

🔨 Fixes:

  • Remove the clear_checkpoint function during Trial restoration error handling. (#48532)

Ray Serve

🎉 New Features:

  • Initial version of local_testing_mode (#48477)

💫 Enhancements:

  • Handle multiple changed objects per LongPollHost.listen_for_change RPC (#48803)
  • Add more nuanced checks for http proxy status errors (#47896)
  • Improve replica access log messages to include HTTP status info and better resemble standard log format (#48819)
  • Propagate replica constructor error to deployment status message and print num retries left (#48531)

🔨 Fixes:

  • Pending requests that are cancelled before they were assigned to a replica now also return a serve.RequestCancelledError (#48496)

RLlib

💫 Enhancements:

  • Release test enhancements. (#45803, #48681)
  • Make opencv-python-headless default over opencv-python (#48776)
  • Reverse learner queue behavior of IMPALA/APPO (consume oldest batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)

🔨 Fixes:

  • Fix torch scheduler stepping and reporting. (#48125)
  • Fix accumulation of results over n training_step calls within same iteration (new API stack). (#48136)
  • Various other fixes: #48563, #48314, #48698, #48869.

📖 Documentation:

  • Upgrade examples script overview page (new API stack). (#48526)
  • Enable RLlib + Serve example in CI and translate to new API stack. (#48687)

🏗 Architecture refactoring:

  • Switch new API stack on by default, APPO, IMPALA, BC, MARWIL, and CQL. (#48516, #48599)
  • Various APPO enhancements (new API stack): Circular buffer (#48798), minor loss math fixes (#48800), target network update logic (#48802), smaller cleanups (#48844).
  • Remove rllib_contrib from repo. (#48565)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

💫 Enhancements:

  • [CompiledGraphs] Refine schedule visualization (#48594)

🔨 Fixes:

  • [CompiledGraphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs (#48463)
  • [Core] Fix Ascend NPU discovery to support 8+ cards per node (#48543)
  • [Core] Make Placement Group Wildcard and Indexed Resource Assignments Consistent (#48088)
  • [Core] Stop the GRPC server before Shut down the Object Store (#48572)

Ray Clusters

🔨 Fixes:

  • [KubeRay]: Fix ConnectionError on Autoscaler CR lookups in K8s clusters with custom DNS for Kubernetes API. (#48541)

Dashboard

💫 Enhancements:

  • Add global UTC timezone button in navbar with local storage (#48510)
  • Add memory graphs optimized for OOM debugging (#48530)
  • Improve tasks/actors metric naming and add graph for running tasks (#48528)
    add actor pid to dashboard (#48791)

🔨 Fixes:

  • Fix Placement Group Table table cells overflow (#47323)
  • Fix Rows Outputted being zero on Ray Data Dashboard (#48745)
  • fix confusing dataset operator name (#48805)

Thanks

Thanks to all those who contributed to this release!
@rynewang, @rickyyx, @bveeramani, @marwan116, @simonsays1980, @dayshah, @dentiny, @KepingYan, @mimiliaogo, @kevin85421, @SeaOfOcean, @stephanie-wang, @mohitjain2504, @azayz, @xushiyan, @richardliaw, @can-anyscale, @xingyu-long, @kanwang, @aslonnie, @MortalHappiness, @jjyao, @SumanthRH, @matthewdeng, @alexeykudinkin, @sven1977, @raulchen, @andrewsykim, @zcin, @nadongjun, @hongpeng-guo, @miguelteixeiraa, @saihaj, @khluu, @ArturNiederfahrenhorst, @ryanaoleary, @ltbringer, @pcmoritz, @JoshKarpel, @akyang-anyscale, @frances720, @BeingGod, @edoakes, @Bye-legumes, @Superskyyy, @liuxsh9, @MengjinYan, @ruisearch42, @scottjlee, @angelinalg

Ray-2.39.0

13 Nov 19:50
5a6c335
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🔨 Fixes:

  • Fixed InvalidObjectError edge case with Dataset.split() (#48130)
  • Made Concatenator preserve order of concatenated columns (#47997)

📖 Documentation:

  • Improved documentation around Parquet column and predicate pushdown (#48095)
  • Marked num_rows_per_file parameter of write APIs as experimental (#48208)
  • One hot encoder now returns an encoded vector (#48173)
  • transform_batch no longer fails on missing columns (#48137)

🏗 Architecture refactoring:

  • Dataset.count() now uses a Count logical operator (#48126)

🗑 Deprecations:

  • Removed long-deprecated set_progress_bars (#48203)

Ray Train

🔨 Fixes:

  • Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Tune

🔨 Fixes:

  • Safely check if the storage filesystem is pyarrow.fs.S3FileSystem (#48216)

Ray Serve

💫 Enhancements:

  • Cancelled requests now return a serve.RequestCancelledError (#48444)
  • Exposed application source in app details model (#45522)

🔨 Fixes:

  • Basic HTTP deployments will now return “Internal Server Error” instead of a traceback to match FastAPI behavior (#48491)
  • Fixed an issue where high values of max_ongoing_requests couldn’t be reached due to an interaction with core’s max_concurrency (#48274)
  • Fixed an edge case where pending requests were not canceled properly (#47873)
  • Removed deprecated API to set route_prefix per-deployment (#48223)

📖 Documentation:

  • Added ProxyStatus model to reference docs (#48299)
  • Added ApplicationStatus model to reference docs (#48220)

RLlib

💫 Enhancements:

  • Upgrade to gymnasium==1.0.0 (support new API for vector env resets). (#48443, #45328)
  • Add off-policy'ness metric to new API stack. (#48227)
  • Validate episodes before adding them to the buffer. (#48083)

📖 Documentation:

  • New example script for custom metrics on EnvRunners (using MetricsLogger API on the new stack). (#47969)
  • Do-over: New RLlib index page. (#48285, #48442)
  • Do-over: Example script for AutoregressiveActionsRLM. (#47972)

🏗 Architecture refactoring:

  • New API stack on by default for PPO. (#48284)
  • Change config.fault_tolerance default behavior (from recreate_failed_env_runners=False to True). (#48286)

🔨 Fixes:

Ray Core

🎉 New Features:

  • [CompiledGraphs] Support all reduce collective in aDAG (#47621)
  • [CompiledGraphs] Add visualization of compiled graphs (#47958)

💫 Enhancements:

  • [Distributed Debugger] The distributed debugger can now be used without having to set RAY_DEBUG=1, see #48301 and https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html. If you want to restore the previous behavior and use the CLI based debugger, you need to set RAY_DEBUG=legacy.
  • [Core] Add more infos to each breakpoint for ray debug CLI (#48202)
  • [Core] Add demands info to GCS debug state (#48115)
  • [Core] Add PENDING_ACTOR_TASK_ARGS_FETCH and PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus (#48242)
  • [Core] Add metrics ray_io_context_event_loop_lag_ms. (#47989)
  • [Core] Better log format when show the disk size (#46869)
  • [CompiledGraphs] Support asyncio.gather on multiple CompiledDAGFutures (#47860)
  • [CompiledGraphs] Raise an exception if a leaf node is found during compilation (#47757)

🔨 Fixes:

  • [Core] Posts CoreWorkerMemoryStore callbacks onto io_context to fix deadlock (#47833)

Dashboard

🔨 Fixes:

  • [Dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count (#48229)

Thanks

Many thanks to all those who contributed to this release!

@akyang-anyscale, @rkooo567, @bveeramani, @dayshah, @martinbomio, @khluu, @justinvyu, @slfan1989, @alexeykudinkin, @simonsays1980, @vigneshka, @ruisearch42, @rynewang, @scottjlee, @jjyao, @JoshKarpel, @win5923, @MengjinYan, @MortalHappiness, @ujjawal-khare-27, @zcin, @ccoulombe, @Bye-legumes, @dentiny, @stephanie-wang, @LeoLiao123, @dengwxn, @richo-anyscale, @pcmoritz, @sven1977, @omatthew98, @GeneDer, @srinathk10, @can-anyscale, @edoakes, @kevin85421, @aslonnie, @jeffreyjeffreywang, @ArturNiederfahrenhorst

Ray-2.38.0

23 Oct 21:57
385ee46
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Add Dataset.rename_columns (#47906)
  • Basic structured logging (#47210)

💫 Enhancements:

  • Add partitioning parameter to read_parquet (#47553)
  • Add SERVICE_UNAVAILABLE to list of retried transient errors (#47673)
  • Re-phrase the streaming executor current usage string (#47515)
  • Remove ray.kill in ActorPoolMapOperator (#47752)
  • Simplify and consolidate progress bar outputs (#47692)
  • Refactor OpRuntimeMetrics to support properties (#47800)
  • Refactor plan_write_op and Datasinks (#47942)
  • Link PhysicalOperator to its LogicalOperator (#47986)
  • Allow specifying both num_cpus and num_gpus for map APIs (#47995)
  • Allow specifying insertion index when registering custom plan optimization Rules (#48039)
  • Adding in better framework for substituting logging handlers (#48056)

🔨 Fixes:

  • Fix bug where Ray Data incorrectly emits progress bar warning (#47680)
  • Yield remaining results from async map_batches (#47696)
  • Fix event loop mismatch with async map (#47907)
  • Make sure num_gpus provide to Ray Data is appropriately passed to ray.remote call (#47768)
  • Fix unequal partitions when grouping by multiple keys (#47924)
  • Fix reading multiple parquet files with ragged ndarrays (#47961)
  • Removing unneeded test case (#48031)
  • Adding in better json checking in test logging (#48036)
  • Fix bug with inserting custom optimization rule at index 0 (#48051)
  • Fix logging output from write_xxx APIs (#48096)

📖 Documentation:

  • Add docs section for Ray Data progress bars (#47804)
  • Add reference to parquet predicate pushdown (#47881)
  • Add tip about how to understand map_batches format (#47394)

Ray Train

🏗 Architecture refactoring:

  • Remove deprecated mosaic and sklearn trainer code (#47901)

Ray Tune

🔨 Fixes:

  • Fix WandbLoggerCallback to reuse actors upon restore (#47985)

Ray Serve

🔨 Fixes:

  • Stop scheduling task early when requests have been canceled (#47847)

RLlib

🎉 New Features:

  • Enable cloud checkpointing. (#47682)

💫 Enhancements:

🔨 Fixes:

🏗 Architecture refactoring:

  • Switch on new API stack by default for SAC and DQN. (#47217)
  • Remove Tf support on new API stack for PPO/IMPALA/APPO (only DreamerV3 on new API stack remains with tf now). (#47892)
  • Discontinue support for "hybrid" API stack (using RLModule + Learner, but still on RolloutWorker and Policy) (#46085)
  • RLModule (new API stack) refinements: #47884, #47885, #47889, #47908, #47915, #47965, #47775

📖 Documentation:

  • Add new API stack migration guide. (#47779)
  • New API stack example script: BC pre training, then PPO finetuning using same RLModule class. (#47838)
  • New API stack: Autoregressive actions example. (#47829)
  • Remove old API stack connector docs entirely. (#47778)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • CompiledGraphs: support multi readers in multi node when DAG is created from an actor (#47601)

💫 Enhancements:

  • Add a flag to raise exception for out of band serialization of ObjectRef (#47544)
  • Store each GCS table in its own Redis Hash (#46861)
  • Decouple create worker vs pop worker request. (#47694)
  • Add metrics for GCS jobs (#47793)

🔨 Fixes:

  • Fix broken dashboard cluster page when there are dead nodes (#47701)
  • Fix the ray_tasks{State="PENDING_ARGS_FETCH"} metric counting (#47770)
  • Separate the attempt_number with the task_status in memory summary and object list (#47818)
  • Fix object reconstruction hang on arguments pending creation (#47645)
  • Fix check failure: sync_reactors_.find(reactor->GetRemoteNodeID()) == sync_reactors_.end() (#47861)
  • Fix check failure RAY_CHECK(it != current_tasks_.end()); (#47659)

📖 Documentation:

  • KubeRay docs: Add docs for YuniKorn Gang scheduling #47850

Dashboard

💫 Enhancements:

  • Performance improvements for large scale clusters (#47617)

🔨 Fixes:

  • Placement group and required resources not showing correctly in dashboard (#47754)

Thanks

Many thanks to all those who contributed to this release!
@GeneDer, @rkooo567, @dayshah, @saihaj, @nikitavemuri, @bill-oconnor-anyscale, @WeichenXu123, @can-anyscale, @jjyao, @edoakes, @kekulai-fredchang, @bveeramani, @alexeykudinkin, @raulchen, @khluu, @sven1977, @ruisearch42, @dentiny, @MengjinYan, @Mark2000, @simonsays1980, @rynewang, @PatricYan, @zcin, @sofianhnaide, @matthewdeng, @dlwh, @scottjlee, @MortalHappiness, @kevin85421, @win5923, @aslonnie, @prithvi081099, @richardsliu, @milesvant, @omatthew98, @Superskyyy, @pcmoritz

Ray-2.37.0

24 Sep 23:37
1b620f2
Compare
Choose a tag to compare

Ray Libraries

Ray Data

💫 Enhancements:

  • Simplify custom metadata provider API (#47575)
  • Change counts of metrics to rates of metrics (#47236)
  • Throw exception for non-streaming HF datasets with "override_num_blocks" argument (#47559)
  • Refactor custom optimizer rules (#47605)

🔨 Fixes:

  • Remove ineffective retry code in plan_read_op (#47456)
  • Fix incorrect pending task size if outputs are empty (#47604)

Ray Train

💫 Enhancements:

  • Update run status and add stack trace to TrainRunInfo (#46875)

Ray Serve

💫 Enhancements:

  • Allow control of some serve configuration via env vars (#47533)
  • [serve] Faster detection of dead replicas (#47237)

🔨 Fixes:

  • [Serve] fix component id logging field (#47609)

RLlib

💫 Enhancements:

  • New API stack:
    • Add restart-failed-env option to EnvRunners. (#47608)
    • Offline RL: Store episodes in state form. (#47294)
    • Offline RL: Replace GAE in MARWILOfflinePreLearner with GeneralAdvantageEstimation connector in learner pipeline. (#47532)
    • Off-policy algos: Add episode sampling to EpisodeReplayBuffer. (#47500)
    • RLModule APIs: Add SelfSupervisedLossAPI for RLModules that bring their own loss and InferenceOnlyAPI. (#47581, #47572)

Ray Core

💫 Enhancements:

  • [aDAG] Allow custom NCCL group for aDAG (#47141)
  • [aDAG] support buffered input (#47272)
  • [aDAG] Support multi node multi reader (#47480)
  • [Core] Make is_gpu, is_actor, root_detached_id fields late bind to workers. (#47212)
  • [Core] Reconstruct actor to run lineage reconstruction triggered actor task (#47396)
  • [Core] Optimize GetAllJobInfo API for performance (#47530)

🔨 Fixes:

  • [aDAG] Fix ranks ordering for custom NCCL group (#47594)

Ray Clusters

📖 Documentation:

  • [KubeRay] add a guide for deploying vLLM with RayService (#47038)

Thanks

Many thanks to all those who contributed to this release!
@ruisearch42, @andrewsykim, @timkpaine, @rkooo567, @WeichenXu123, @GeneDer, @sword865, @simonsays1980, @angelinalg, @sven1977, @jjyao, @woshiyyya, @aslonnie, @zcin, @omatthew98, @rueian, @khluu, @justinvyu, @bveeramani, @nikitavemuri, @chris-ray-zhang, @liuxsh9, @xingyu-long, @peytondmurray, @rynewang