feat(ray): Implement dynamic scale-in for RaySwordfishActor by huleilei · Pull Request #5903 · Eventual-Inc/Daft

huleilei · 2025-12-31T12:04:38Z

Changes Made

This commit implements the dynamic scaling down (scale-in) functionality for RaySwordfishActor to release idle resources.

Idle Worker Retirement: Implemented retire_idle_ray_workers in RayWorkerManager. It identifies workers that have been idle for longer than DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60s) and releases them.
Scheduler Integration: The SchedulerActor loop now periodically checks for idle workers and triggers retirement while maintaining a minimum survivor count (DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS, default: 1).
Blacklist Mechanism: Added a pending_release_blacklist to prevent the autoscaler from immediately respawning workers that were just released.

Configuration

New environment variables added:

DAFT_AUTOSCALING_DOWNSCALE_ENABLED: Enable/disable downscaling (default: true).
DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS: Seconds a worker must be idle before retirement (default: 60).
DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS: Minimum number of workers to keep alive (default: 1).
DAFT_AUTOSCALING_PENDING_RELEASE_EXCLUDE_SECONDS: TTL for blacklisted worker IDs (default: 120).

Related Issues

#5683

greptile-apps · 2025-12-31T12:08:10Z

Greptile Summary

This PR implements dynamic scale-in for RaySwordfishActor to release idle Ray workers and reduce cluster costs. The implementation adds idle worker retirement logic with configurable thresholds, maintains a minimum survivor count, and uses a blacklist to prevent immediate respawning of released workers.

Key changes:

Added retire_idle_ray_workers() method to WorkerManager that identifies workers idle for longer than DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60s) and releases them
Integrated downscaling into the scheduler loop with periodic checks every 1 second when no scale-up is needed
Implemented pending_release_blacklist mechanism with TTL to prevent autoscaler from immediately respawning just-released workers
Added head node protection to prevent retirement of workers on the Ray head node (which cannot be scaled down by Ray autoscaler)
Final cleanup on job completion releases all idle workers and clears autoscaling requests
Configuration via environment variables: DAFT_AUTOSCALING_DOWNSCALE_ENABLED (default: true), DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS (default: 1), DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60), DAFT_AUTOSCALING_PENDING_RELEASE_EXCLUDE_SECONDS (default: 120)

The implementation is well-tested with comprehensive unit tests for the retirement logic, blacklist mechanism, and integration with the scheduler.

Confidence Score: 4/5

Safe to merge with minor edge case considerations around idle detection timing
Implementation is solid with proper head node protection, blacklist mechanism, and comprehensive testing. Score reflects a minor semantic gap where scheduler counts idle workers differently than worker manager filters them (by duration), which could cause fewer retirements than expected but isn't a correctness issue. The defensive programming (checking active tasks before release, clearing blacklist on scale-up demand) and extensive test coverage demonstrate quality work.
No files require special attention. The most complex logic in worker_manager.rs is well-structured with appropriate safety checks.

Important Files Changed

Filename	Overview
src/daft-distributed/src/python/ray/worker_manager.rs	Implements `retire_idle_ray_workers` method with blacklist mechanism, head node protection, and idle duration filtering. Good defensive logic to prevent premature worker respawn.
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs	Integrates downscaling into scheduler loop with environment-based configuration. Counts idle candidates based on empty active tasks, but `retire_idle_ray_workers` applies additional idle duration filter which may cause mismatch.
src/daft-distributed/src/python/ray/worker.rs	Adds `ActorState` enum, `is_idle()`, `idle_duration()`, and `release()` methods. Clean implementation with proper state transitions and safety checks.

Sequence Diagram

sequenceDiagram
    participant S as SchedulerActor
    participant WM as WorkerManager
    participant W as RaySwordfishWorker
    participant Ray as Ray Autoscaler
    
    Note over S: Main scheduler loop (every 1s tick)
    
    S->>WM: worker_snapshots()
    WM-->>S: List of worker states
    
    alt Has pending tasks requiring scale-up
        S->>WM: try_autoscale(bundles)
        WM->>WM: Clear pending_release_blacklist
        WM->>Ray: request_resources(bundles)
    else No scale-up needed AND downscale_enabled
        S->>S: Count idle workers (empty active_task_details)
        S->>S: Calculate num_to_retire = min(idle_count, allowed_to_retire)
        alt num_to_retire > 0
            S->>WM: retire_idle_ray_workers(num_to_retire, false)
            WM->>Ray: get_head_node_id()
            Ray-->>WM: head_node_id
            WM->>WM: Filter candidates: skip head node, check idle duration
            WM->>WM: Sort by longest idle duration
            WM->>WM: Select top N candidates
            loop For each selected worker
                WM->>W: release()
                W->>W: Check no active tasks
                W->>W: shutdown()
                W->>W: Set state to Released
                WM->>WM: Add worker_id to pending_release_blacklist
            end
            WM->>Ray: clear_autoscaling_requests()
        end
    end
    
    Note over S: On job completion (loop exit)
    alt downscale_enabled
        S->>WM: retire_idle_ray_workers(all_workers, true)
        Note over WM: force_all_when_cluster_idle=true
        WM->>Ray: clear_autoscaling_requests()
        WM->>WM: Release all idle workers
    end

greptile-apps

Additional Comments (3)

src/daft-distributed/src/python/ray/worker_manager.rs, line 287-340 (link)

logic: Holding the mutex lock while calling worker.release(py) and Python operations can cause significant lock contention. The state mutex is held from line 288 through line 340, during which Python GIL operations occur (lines 334-340). This blocks other operations like submit_tasks_to_workers unnecessarily.

Consider releasing the lock before Python operations:
src/daft-distributed/src/python/ray/worker.rs, line 138-146 (link)

style: The release method silently returns early if there are inflight tasks without setting state or logging. This could lead to confusion during debugging.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs, line 138-144 (link)

style: Environment variable parsing with defaults lacks documentation. Consider adding comments explaining these configuration options and their defaults.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

codecov · 2025-12-31T12:54:02Z

Codecov Report

❌ Patch coverage is 48.11594% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.83%. Comparing base (7bec778) to head (fab6e3d).

Files with missing lines	Patch %	Lines
.../daft-distributed/src/python/ray/worker_manager.rs	0.00%	124 Missing ⚠️
src/daft-distributed/src/python/ray/worker.rs	0.00%	36 Missing ⚠️
...ibuted/src/scheduling/scheduler/scheduler_actor.rs	77.14%	8 Missing ⚠️
...aft-distributed/src/scheduling/scheduler/linear.rs	0.00%	6 Missing ⚠️
daft/runners/flotilla.py	20.00%	4 Missing ⚠️
src/daft-distributed/src/scheduling/worker.rs	99.24%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5903      +/-   ##
==========================================
- Coverage   72.91%   72.83%   -0.09%     
==========================================
  Files         973      973              
  Lines      126166   126494     +328     
==========================================
+ Hits        91995    92132     +137     
- Misses      34171    34362     +191

Files with missing lines	Coverage Δ
...ft-distributed/src/scheduling/scheduler/default.rs	`88.99% <100.00%> (+0.02%)`	⬆️
...c/daft-distributed/src/scheduling/scheduler/mod.rs	`88.04% <ø> (ø)`
src/daft-distributed/src/scheduling/worker.rs	`86.74% <99.24%> (+12.50%)`	⬆️
daft/runners/flotilla.py	`46.85% <20.00%> (-0.79%)`	⬇️
...aft-distributed/src/scheduling/scheduler/linear.rs	`87.50% <0.00%> (-2.25%)`	⬇️
...ibuted/src/scheduling/scheduler/scheduler_actor.rs	`89.22% <77.14%> (-0.96%)`	⬇️
src/daft-distributed/src/python/ray/worker.rs	`0.00% <0.00%> (ø)`
.../daft-distributed/src/python/ray/worker_manager.rs	`0.00% <0.00%> (ø)`

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

huleilei · 2026-01-20T15:43:53Z

@colin-ho @universalmind303 help me review when you are convenient. Thanks

srilman

Overall this makes sense to me and I think would be useful. However, I feel like it would make more sense for the worker manager to have the logic to determine what workers should be retired vs within the scheduler and then passing into the worker manager.

srilman · 2026-03-31T17:38:16Z

+        // - `DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS`: The minimum number of workers to keep
+        //   running even if they are idle. This prevents the cluster from scaling down to
+        //   zero workers during brief idle periods. Defaults to `1`.
+        let downscale_enabled = std::env::var("DAFT_AUTOSCALING_DOWNSCALE_ENABLED")


Actually I think these configurations would be useful to have in the daft.set_runner_ray configuration on top of these environment variables

madvart · 2026-04-14T19:56:01Z

@huleilei - Checking to see if you are ok working on changes recommended by @srilman

huleilei · 2026-04-15T08:26:35Z

I'm

@huleilei - Checking to see if you are ok working on changes recommended by @srilman

Sorry, I have been reviewing this PR recently. Thank you

codspeed-hq · 2026-04-25T10:49:50Z

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks¹

_{Comparing huleilei:hll/auto (55953ef) with main (766f0f8)}

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

huleilei · 2026-05-06T11:42:59Z

@srilman @madvart Please help me review when you have time, thanks.

DogerW666 · 2026-05-08T03:09:28Z

This PR is also useful for us.

srilman

Overall, LGTM. Had 1 small thing, but once thats addressed plus the merge conflict, we can merge

srilman · 2026-06-02T04:58:05Z

+                            "Downscale: retired idle workers"
+                        );
+                    }
+                }


Rather than having this code here, can we move it to the WorkerManager directly? Plus, rather than a specific retire_idle_ray_workers, can we have a generic retire_idle_workers

madvart · 2026-06-03T16:08:55Z

@huleilei - Thanks for your patience and work on this. Once you are able to resolve the last comment and the merge conflicts, we can merge this. Thanks again.

@srilman

Adds an opt-in Ray downscaling (scale-in) mechanism by retiring idle Flotilla workers to help Ray autoscaler shrink clusters when workloads become idle. Highlights: - Retire idle Ray workers with configurable idle threshold and a min-survivor floor. - Head-node protection and a pending-release blacklist to avoid immediate respawn. - Expose configuration via `daft.set_runner_ray(...)` and environment variables. - Keep scale-up behavior aligned with upstream high-water-mark ramp-up logic. - Document autoscaling/downscaling in `docs/distributed/ray.md`. Follow-ups on top of upstream PR Eventual-Inc#5903: - Resolve merge conflict with `main`: adopt the new `(scheduled, cancelled)` tuple from `Scheduler::schedule_tasks` and the `worker_id` field on `TaskEvent::Scheduled`. - Address @srilman review: move the per-tick downscale gating from `scheduler_actor.rs` into the generic `WorkerManager::retire_idle_workers`; `RayWorkerManager` now owns the enable flag, min-survivor floor, idle thresholds, head-node protection and blacklist TTLs.

huleilei · 2026-06-05T02:55:04Z

@madvart @srilman Hello, I have made the revisions. Thank you for reviewing them again. Best regards

huleilei · 2026-06-09T08:05:48Z

@madvart @srilman Hello, I have made the revisions. Thank you for reviewing them again. Best regards

* origin/main: (115 commits) feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520) fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095) fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047) feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903) feat(delta-lake): support column mapping for reads (Eventual-Inc#7005) feat(functions): add string distance/similarity functions (Eventual-Inc#7068) test(parquet): cover read_parquet edge cases (Eventual-Inc#7085) refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078) fix(asof-join): use unknown clustering spec instead of hash (Eventual-Inc#7075) docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066) feat: add try_cast function for safe type conversion (Eventual-Inc#6960) refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747) fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055) feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071) feat(iceberg): support branch and tag reads (Eventual-Inc#7042) fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064) perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059) feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067) fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056) refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (Eventual-Inc#7053) ... # Conflicts: # daft/checkpoint.py # src/daft-distributed/src/pipeline_node/limit.rs # src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs # src/daft-distributed/src/scheduling/task.rs # src/daft-local-execution/src/pipeline.rs # src/daft-local-execution/src/sinks/blocking_sink.rs # src/daft-local-execution/src/sources/scan_task.rs

github-actions Bot added the feat label Dec 31, 2025

huleilei mentioned this pull request Dec 31, 2025

WIP： feat: add downscale support via idle worker retirement in flotilla mode #5516

Closed

4 tasks

huleilei marked this pull request as draft December 31, 2025 12:05

greptile-apps Bot reviewed Dec 31, 2025

View reviewed changes

huleilei force-pushed the hll/auto branch 2 times, most recently from 1dc006a to e801b78 Compare January 14, 2026 13:22

huleilei force-pushed the hll/auto branch from e801b78 to fab6e3d Compare January 20, 2026 12:39

huleilei marked this pull request as ready for review January 20, 2026 14:12

madvart requested a review from srilman March 20, 2026 00:48

madvart requested a review from desmondcheongzx March 31, 2026 20:23

srilman reviewed Mar 31, 2026

View reviewed changes

desmondcheongzx removed their request for review April 21, 2026 00:12

huleilei force-pushed the hll/auto branch from fab6e3d to ee638fe Compare April 25, 2026 09:28

huleilei requested a review from a team as a code owner April 25, 2026 09:28

huleilei force-pushed the hll/auto branch from 50a23ce to 9f3d2c2 Compare May 6, 2026 02:32

srilman approved these changes Jun 2, 2026

View reviewed changes

huleilei force-pushed the hll/auto branch from 9f3d2c2 to db17f30 Compare June 4, 2026 11:46

Merge branch 'main' into hll/auto

55953ef

madvart merged commit 69fe81e into Eventual-Inc:main Jun 9, 2026
64 of 66 checks passed

Conversation

huleilei commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Configuration

Related Issues

Uh oh!

greptile-apps Bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

codecov Bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

huleilei commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

srilman Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

madvart commented Apr 14, 2026

Uh oh!

huleilei commented Apr 15, 2026

Uh oh!

codspeed-hq Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

huleilei commented May 6, 2026

Uh oh!

DogerW666 commented May 8, 2026

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

srilman Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

madvart commented Jun 3, 2026

Uh oh!

huleilei commented Jun 5, 2026

Uh oh!

huleilei commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huleilei commented Dec 31, 2025 •

edited

Loading

greptile-apps Bot commented Dec 31, 2025 •

edited

Loading

greptile-apps Bot left a comment •

edited

Loading

codecov Bot commented Dec 31, 2025 •

edited

Loading

huleilei commented Jan 20, 2026 •

edited

Loading

codspeed-hq Bot commented Apr 25, 2026 •

edited

Loading