Skip to content

feat(ray): Implement dynamic scale-in for RaySwordfishActor#5903

Merged
madvart merged 2 commits into
Eventual-Inc:mainfrom
huleilei:hll/auto
Jun 9, 2026
Merged

feat(ray): Implement dynamic scale-in for RaySwordfishActor#5903
madvart merged 2 commits into
Eventual-Inc:mainfrom
huleilei:hll/auto

Conversation

@huleilei

@huleilei huleilei commented Dec 31, 2025

Copy link
Copy Markdown
Collaborator

Changes Made

This commit implements the dynamic scaling down (scale-in) functionality for RaySwordfishActor to release idle resources.

  • Idle Worker Retirement: Implemented retire_idle_ray_workers in RayWorkerManager. It identifies workers that have been idle for longer than DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60s) and releases them.
  • Scheduler Integration: The SchedulerActor loop now periodically checks for idle workers and triggers retirement while maintaining a minimum survivor count (DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS, default: 1).
  • Blacklist Mechanism: Added a pending_release_blacklist to prevent the autoscaler from immediately respawning workers that were just released.

Configuration

New environment variables added:

  • DAFT_AUTOSCALING_DOWNSCALE_ENABLED: Enable/disable downscaling (default: true).
  • DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS: Seconds a worker must be idle before retirement (default: 60).
  • DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS: Minimum number of workers to keep alive (default: 1).
  • DAFT_AUTOSCALING_PENDING_RELEASE_EXCLUDE_SECONDS: TTL for blacklisted worker IDs (default: 120).

Related Issues

#5683

@greptile-apps

greptile-apps Bot commented Dec 31, 2025

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR implements dynamic scale-in for RaySwordfishActor to release idle Ray workers and reduce cluster costs. The implementation adds idle worker retirement logic with configurable thresholds, maintains a minimum survivor count, and uses a blacklist to prevent immediate respawning of released workers.

Key changes:

  • Added retire_idle_ray_workers() method to WorkerManager that identifies workers idle for longer than DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60s) and releases them
  • Integrated downscaling into the scheduler loop with periodic checks every 1 second when no scale-up is needed
  • Implemented pending_release_blacklist mechanism with TTL to prevent autoscaler from immediately respawning just-released workers
  • Added head node protection to prevent retirement of workers on the Ray head node (which cannot be scaled down by Ray autoscaler)
  • Final cleanup on job completion releases all idle workers and clears autoscaling requests
  • Configuration via environment variables: DAFT_AUTOSCALING_DOWNSCALE_ENABLED (default: true), DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS (default: 1), DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 60), DAFT_AUTOSCALING_PENDING_RELEASE_EXCLUDE_SECONDS (default: 120)

The implementation is well-tested with comprehensive unit tests for the retirement logic, blacklist mechanism, and integration with the scheduler.

Confidence Score: 4/5

  • Safe to merge with minor edge case considerations around idle detection timing
  • Implementation is solid with proper head node protection, blacklist mechanism, and comprehensive testing. Score reflects a minor semantic gap where scheduler counts idle workers differently than worker manager filters them (by duration), which could cause fewer retirements than expected but isn't a correctness issue. The defensive programming (checking active tasks before release, clearing blacklist on scale-up demand) and extensive test coverage demonstrate quality work.
  • No files require special attention. The most complex logic in worker_manager.rs is well-structured with appropriate safety checks.

Important Files Changed

Filename Overview
src/daft-distributed/src/python/ray/worker_manager.rs Implements retire_idle_ray_workers method with blacklist mechanism, head node protection, and idle duration filtering. Good defensive logic to prevent premature worker respawn.
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs Integrates downscaling into scheduler loop with environment-based configuration. Counts idle candidates based on empty active tasks, but retire_idle_ray_workers applies additional idle duration filter which may cause mismatch.
src/daft-distributed/src/python/ray/worker.rs Adds ActorState enum, is_idle(), idle_duration(), and release() methods. Clean implementation with proper state transitions and safety checks.

Sequence Diagram

sequenceDiagram
    participant S as SchedulerActor
    participant WM as WorkerManager
    participant W as RaySwordfishWorker
    participant Ray as Ray Autoscaler
    
    Note over S: Main scheduler loop (every 1s tick)
    
    S->>WM: worker_snapshots()
    WM-->>S: List of worker states
    
    alt Has pending tasks requiring scale-up
        S->>WM: try_autoscale(bundles)
        WM->>WM: Clear pending_release_blacklist
        WM->>Ray: request_resources(bundles)
    else No scale-up needed AND downscale_enabled
        S->>S: Count idle workers (empty active_task_details)
        S->>S: Calculate num_to_retire = min(idle_count, allowed_to_retire)
        alt num_to_retire > 0
            S->>WM: retire_idle_ray_workers(num_to_retire, false)
            WM->>Ray: get_head_node_id()
            Ray-->>WM: head_node_id
            WM->>WM: Filter candidates: skip head node, check idle duration
            WM->>WM: Sort by longest idle duration
            WM->>WM: Select top N candidates
            loop For each selected worker
                WM->>W: release()
                W->>W: Check no active tasks
                W->>W: shutdown()
                W->>W: Set state to Released
                WM->>WM: Add worker_id to pending_release_blacklist
            end
            WM->>Ray: clear_autoscaling_requests()
        end
    end
    
    Note over S: On job completion (loop exit)
    alt downscale_enabled
        S->>WM: retire_idle_ray_workers(all_workers, true)
        Note over WM: force_all_when_cluster_idle=true
        WM->>Ray: clear_autoscaling_requests()
        WM->>WM: Release all idle workers
    end
Loading

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (3)

  1. src/daft-distributed/src/python/ray/worker_manager.rs, line 287-340 (link)

    logic: Holding the mutex lock while calling worker.release(py) and Python operations can cause significant lock contention. The state mutex is held from line 288 through line 340, during which Python GIL operations occur (lines 334-340). This blocks other operations like submit_tasks_to_workers unnecessarily.

    Consider releasing the lock before Python operations:

  2. src/daft-distributed/src/python/ray/worker.rs, line 138-146 (link)

    style: The release method silently returns early if there are inflight tasks without setting state or logging. This could lead to confusion during debugging.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs, line 138-144 (link)

    style: Environment variable parsing with defaults lacks documentation. Consider adding comments explaining these configuration options and their defaults.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

8 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@codecov

codecov Bot commented Dec 31, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 48.11594% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.83%. Comparing base (7bec778) to head (fab6e3d).

Files with missing lines Patch % Lines
.../daft-distributed/src/python/ray/worker_manager.rs 0.00% 124 Missing ⚠️
src/daft-distributed/src/python/ray/worker.rs 0.00% 36 Missing ⚠️
...ibuted/src/scheduling/scheduler/scheduler_actor.rs 77.14% 8 Missing ⚠️
...aft-distributed/src/scheduling/scheduler/linear.rs 0.00% 6 Missing ⚠️
daft/runners/flotilla.py 20.00% 4 Missing ⚠️
src/daft-distributed/src/scheduling/worker.rs 99.24% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5903      +/-   ##
==========================================
- Coverage   72.91%   72.83%   -0.09%     
==========================================
  Files         973      973              
  Lines      126166   126494     +328     
==========================================
+ Hits        91995    92132     +137     
- Misses      34171    34362     +191     
Files with missing lines Coverage Δ
...ft-distributed/src/scheduling/scheduler/default.rs 88.99% <100.00%> (+0.02%) ⬆️
...c/daft-distributed/src/scheduling/scheduler/mod.rs 88.04% <ø> (ø)
src/daft-distributed/src/scheduling/worker.rs 86.74% <99.24%> (+12.50%) ⬆️
daft/runners/flotilla.py 46.85% <20.00%> (-0.79%) ⬇️
...aft-distributed/src/scheduling/scheduler/linear.rs 87.50% <0.00%> (-2.25%) ⬇️
...ibuted/src/scheduling/scheduler/scheduler_actor.rs 89.22% <77.14%> (-0.96%) ⬇️
src/daft-distributed/src/python/ray/worker.rs 0.00% <0.00%> (ø)
.../daft-distributed/src/python/ray/worker_manager.rs 0.00% <0.00%> (ø)

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@huleilei huleilei force-pushed the hll/auto branch 2 times, most recently from 1dc006a to e801b78 Compare January 14, 2026 13:22
@huleilei huleilei marked this pull request as ready for review January 20, 2026 14:12
@huleilei

huleilei commented Jan 20, 2026

Copy link
Copy Markdown
Collaborator Author

@colin-ho @universalmind303 help me review when you are convenient. Thanks

@madvart madvart requested a review from srilman March 20, 2026 00:48
@madvart madvart requested a review from desmondcheongzx March 31, 2026 20:23

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense to me and I think would be useful. However, I feel like it would make more sense for the worker manager to have the logic to determine what workers should be retired vs within the scheduler and then passing into the worker manager.

// - `DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS`: The minimum number of workers to keep
// running even if they are idle. This prevents the cluster from scaling down to
// zero workers during brief idle periods. Defaults to `1`.
let downscale_enabled = std::env::var("DAFT_AUTOSCALING_DOWNSCALE_ENABLED")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think these configurations would be useful to have in the daft.set_runner_ray configuration on top of these environment variables

@madvart

madvart commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

@huleilei - Checking to see if you are ok working on changes recommended by @srilman

@huleilei

Copy link
Copy Markdown
Collaborator Author

I'm

@huleilei - Checking to see if you are ok working on changes recommended by @srilman

Sorry, I have been reviewing this PR recently. Thank you

@desmondcheongzx desmondcheongzx removed their request for review April 21, 2026 00:12
@huleilei huleilei requested a review from a team as a code owner April 25, 2026 09:28
@codspeed-hq

codspeed-hq Bot commented Apr 25, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing huleilei:hll/auto (55953ef) with main (766f0f8)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@huleilei

huleilei commented May 6, 2026

Copy link
Copy Markdown
Collaborator Author

@srilman @madvart Please help me review when you have time, thanks.

@DogerW666

Copy link
Copy Markdown

This PR is also useful for us.

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Had 1 small thing, but once thats addressed plus the merge conflict, we can merge

"Downscale: retired idle workers"
);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having this code here, can we move it to the WorkerManager directly? Plus, rather than a specific retire_idle_ray_workers, can we have a generic retire_idle_workers

@madvart

madvart commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@huleilei - Thanks for your patience and work on this. Once you are able to resolve the last comment and the merge conflicts, we can merge this. Thanks again.

Adds an opt-in Ray downscaling (scale-in) mechanism by retiring idle Flotilla workers to help Ray autoscaler shrink clusters when workloads become idle.

Highlights:
- Retire idle Ray workers with configurable idle threshold and a min-survivor floor.
- Head-node protection and a pending-release blacklist to avoid immediate respawn.
- Expose configuration via `daft.set_runner_ray(...)` and environment variables.
- Keep scale-up behavior aligned with upstream high-water-mark ramp-up logic.
- Document autoscaling/downscaling in `docs/distributed/ray.md`.

Follow-ups on top of upstream PR Eventual-Inc#5903:
- Resolve merge conflict with `main`: adopt the new `(scheduled, cancelled)` tuple from `Scheduler::schedule_tasks` and the `worker_id` field on `TaskEvent::Scheduled`.
- Address @srilman review: move the per-tick downscale gating from `scheduler_actor.rs` into the generic `WorkerManager::retire_idle_workers`; `RayWorkerManager` now owns the enable flag, min-survivor floor, idle thresholds, head-node protection and blacklist TTLs.
@huleilei

huleilei commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@madvart @srilman Hello, I have made the revisions. Thank you for reviewing them again. Best regards

@huleilei

huleilei commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@madvart @srilman Hello, I have made the revisions. Thank you for reviewing them again. Best regards

@madvart madvart merged commit 69fe81e into Eventual-Inc:main Jun 9, 2026
64 of 66 checks passed
chenghuichen added a commit to chenghuichen/Daft that referenced this pull request Jun 10, 2026
* origin/main: (115 commits)
  feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520)
  fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095)
  fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047)
  feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903)
  feat(delta-lake): support column mapping for reads (Eventual-Inc#7005)
  feat(functions): add string distance/similarity functions (Eventual-Inc#7068)
  test(parquet): cover read_parquet edge cases (Eventual-Inc#7085)
  refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078)
  fix(asof-join): use unknown clustering spec instead of hash           (Eventual-Inc#7075)
  docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066)
  feat: add try_cast function for safe type conversion (Eventual-Inc#6960)
  refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747)
  fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055)
  feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071)
  feat(iceberg): support branch and tag reads (Eventual-Inc#7042)
  fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064)
  perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059)
  feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067)
  fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056)
  refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition      (Eventual-Inc#7053)
  ...

# Conflicts:
#	daft/checkpoint.py
#	src/daft-distributed/src/pipeline_node/limit.rs
#	src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs
#	src/daft-distributed/src/scheduling/task.rs
#	src/daft-local-execution/src/pipeline.rs
#	src/daft-local-execution/src/sinks/blocking_sink.rs
#	src/daft-local-execution/src/sources/scan_task.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants