WIP： feat: add downscale support via idle worker retirement in flotilla mode by huleilei · Pull Request #5516 · Eventual-Inc/Daft

huleilei · 2025-11-09T14:32:53Z

Root cause

Autoscaling only triggers scale-up when backlog ratio exceeds threshold, and RayWorkerManager enforces a monotonic upper bound (resource request must exceed historical max).
Daft uses ray.autoscaler.sdk.request_resources which requests incremental capacity only; there is no proactive retirement of idle Swordfish workers, so Ray never sees nodes idle enough to downscale.

What’s changed

Scheduler loop (scheduler_actor.rs):
- New env configs:
  - DAFT_AUTOSCALING_DOWNSCALE_ENABLED (default: true)
  - DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS (default: 1)
WorkerManager trait: added retire_idle_workers(max_to_retire) to enable safe downscale.
RayWorkerManager:
- Implemented retire_idle_workers: selects idle workers (no active tasks) whose idle duration exceeds DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 120s), and shuts down corresponding RaySwordfishActors.
- Kept existing scale-up logic via request_resources with monotonic upper bound.
RaySwordfishWorker:
- Added last_task_finished_at and idle detection helpers (is_idle, idle_duration(now)). Updated mark_task_finished to record last activity.
- Preserved shutdown() binding to kill the underlying actor.
Scheduler (Default/Linear):
- Added get_backlog_resource_requests API to expose backlog demand to the scheduler actor for downscale decision.
Flotilla runner: documented that try_autoscale proxies Ray’s scale-up only API; downscale is driven by proactive worker retirement.

Changes Made

Related Issues

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly

greptile-apps

Greptile Overview

Greptile Summary

This PR implements bidirectional autoscaling for Daft's flotilla mode by adding downscale support via idle worker retirement. Previously, the system could only scale up; now it can also scale down when utilization is low.

Key changes:

Scheduler actor (scheduler_actor.rs): Added ratio-based downscale logic that monitors backlog/capacity ratio and retires idle workers when ratio stays below 0.75 for 10+ ticks. Includes bootstrap expansion for zero-capacity clusters and global idle detection for full cleanup.
Worker manager (worker_manager.rs): Implemented retire_idle_workers() and release_idle_actors() methods with idle duration tracking (default 120s threshold). Scale-up maintains monotonic upper bound to prevent Ray autoscaler conflicts.
Worker tracking (worker.rs): Added last_task_finished_at, is_idle(), and idle_duration() to track worker idle state for safe retirement decisions.
Python integration (flotilla.py): Added actor lifecycle management functions: clear_autoscaling_requests(), sweep_force_release_swordfish_actors(), and force_release_swordfish_actor() to properly clean up Ray actors.
Configuration: 6 new environment variables for tuning downscale behavior (thresholds, stability windows, limits).

How it works:

Scale-up continues via existing ratio-based autoscaling (ratio > 1.25)
Ratio-based downscale: When backlog/capacity < 0.75 for 10 ticks, retire up to 10% of idle workers
Global idle: When backlog=0 and no inflight tasks, release all idle workers
Finalize: Clear all idle actors and reset Ray demand on job completion

The implementation includes comprehensive tests covering bootstrap expansion, ratio-based scaling, and downscale stabilization.

Confidence Score: 3/5

This PR introduces complex autoscaling logic with potential race conditions between downscale paths
Score reflects two critical logical issues: (1) potential race between ratio-based and global idle downscale branches that could attempt to release workers twice in the same tick, and (2) stale worker snapshot state after ratio-based downscale could cause incorrect global idle detection. The core implementation is sound with good test coverage, but these edge cases need resolution before merge.
Pay close attention to scheduler_actor.rs lines 290-370 where the downscale logic has potential race conditions between the ratio-based and global idle branches

Important Files Changed

File Analysis

Filename	Score	Overview
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs	3/5	Added downscale logic with environment-based configuration, stability windows, and worker retirement. Includes bootstrap expansion debounce and global idle detection. Logic appears sound but there are potential race conditions between ratio-based and global idle branches.
src/daft-distributed/src/python/ray/worker_manager.rs	4/5	Implements idle worker retirement with configurable thresholds, monotonic upper bound for scale-up, and proper cleanup via `sweep_force_release_swordfish_actors`. The `release_idle_actors` and `retire_idle_workers` methods provide controlled downscaling.
daft/runners/flotilla.py	4/5	Added `clear_autoscaling_requests()`, `list_swordfish_actors()`, `force_release_swordfish_actor()`, and `sweep_force_release_swordfish_actors()` functions for actor lifecycle management. Uses deprecated `ray.state.actors()` with warning. Retry logic present for robustness.

Sequence Diagram

sequenceDiagram
    participant Scheduler as Scheduler Actor
    participant WM as Worker Manager
    participant Ray as Ray Autoscaler
    participant Workers as Ray Workers
    
    Note over Scheduler: Every tick (1s interval)
    
    Scheduler->>WM: worker_snapshots()
    WM-->>Scheduler: Current worker state
    
    Scheduler->>Scheduler: Calculate backlog/capacity ratio
    
    alt Bootstrap: Zero capacity + backlog > 0
        Scheduler->>WM: try_autoscale(backlog_requests)
        WM->>Ray: request_resources(bundles)
        Ray-->>Workers: Provision new nodes
    end
    
    alt Scale-up: ratio > autoscaling_threshold
        Scheduler->>WM: try_autoscale(pending_tasks)
        WM->>Ray: request_resources(bundles)
        Note over Scheduler: Reset downscale_ticks to 0
    end
    
    alt Ratio-based downscale: ratio < 0.75 for 10+ ticks
        Scheduler->>Scheduler: Increment downscale_below_threshold_ticks
        alt Stable & num_workers > min_survivors
            Scheduler->>WM: release_idle_actors(num_to_retire, false)
            WM->>WM: Select idle workers (idle > 120s)
            WM->>Workers: shutdown() idle actors
            WM->>Ray: sweep_force_release_swordfish_actors()
            Note over Scheduler: Reset downscale_ticks to 0
        end
    end
    
    alt Global idle: backlog=0 & no inflight tasks
        Scheduler->>WM: release_idle_actors(all, true)
        WM->>Workers: shutdown() all idle actors
        Scheduler->>WM: try_autoscale([])
        WM->>Ray: request_resources([]) to clear demand
        Note over Scheduler: Reset downscale_ticks to 0
    end
    
    Note over Scheduler: Job completion
    Scheduler->>WM: release_idle_actors(all, true)
    Scheduler->>WM: try_autoscale([])
    WM->>Ray: Clear resource requests

_{9 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical iterator invalidation bug in the worker retirement logic introduced in a previous commit. The original code was removing workers from a HashMap while iterating over the same collection, which could cause undefined behavior.

Key Changes:

Refactored internal_release_idle_workers to collect workers into a separate Vec before removal
Workers are now removed from the HashMap after the iteration is complete, eliminating the race condition
The fix ensures that the selected iterator is consumed before any mutations to state.ray_workers

Issue Found:

Lines 348-349 contain Chinese comments that should be translated to English for consistency

Confidence Score: 4/5

This PR is safe to merge - it fixes a critical iterator invalidation bug
The fix properly addresses the iterator invalidation issue by collecting worker IDs before removal. The logic is sound and prevents the HashMap from being mutated during iteration. Minor style issue with Chinese comments doesn't affect correctness.
No files require special attention - the fix is straightforward and correct

Important Files Changed

File Analysis

Filename	Score	Overview
src/daft-distributed/src/python/ray/worker_manager.rs	4/5	Fixes iterator invalidation bug by collecting workers before removal; adds Chinese comments that should be in English

Sequence Diagram

sequenceDiagram
    participant Scheduler as Scheduler Actor
    participant WM as RayWorkerManager
    participant State as WorkerManagerState
    participant Worker as RaySwordfishWorker
    participant Ray as Ray/Flotilla

    Note over Scheduler,Ray: Downscale Flow (idle worker retirement)
    
    Scheduler->>WM: retire_idle_workers(max_to_retire)
    WM->>State: Lock state mutex
    WM->>State: Iterate workers to find idle candidates
    State-->>WM: Return idle workers list
    
    Note over WM: Collect worker IDs (no removal yet)
    WM->>WM: Sort by idle duration (longest first)
    WM->>WM: Take up to max_to_retire workers
    
    Note over WM,State: Safe removal phase
    loop For each selected worker
        WM->>State: remove worker from HashMap
        State-->>WM: Return worker object
        WM->>WM: Add to workers_to_release Vec
    end
    
    Note over WM: Release phase (after iteration)
    loop For each worker in workers_to_release
        WM->>Worker: release(py)
        Worker->>Worker: Check no active tasks
        Worker->>Worker: Set state to Releasing
        Worker->>Ray: shutdown()
        Worker->>Worker: Set state to Released
    end
    
    WM->>Ray: clear_autoscaling_requests()
    WM->>State: Get remaining worker IDs
    WM->>Ray: sweep_force_release_swordfish_actors(exclude_ids)
    WM-->>Scheduler: Return number released

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-11T03:55:01Z

Greptile Overview

Greptile Summary

This PR implements downscaling support for Daft's Ray-based autoscaling by enabling proactive idle worker retirement. The implementation adds environment-driven configuration for downscaling thresholds and survivor worker counts, tracks worker idle states with timestamps, and implements a blacklist mechanism to prevent immediate worker respawn during scale operations.

Key changes:

Added idle worker retirement via retire_idle_ray_workers in the WorkerManager trait
Implemented ratio-based downscaling in the scheduler loop that respects minimum survivor worker counts
Added worker state tracking (ActorState, last_task_finished_at) for idle detection
Introduced pending release blacklist to prevent workers from being immediately reprovisioned
Added clear_autoscaling_requests helper to reset Ray's autoscaler demand
Extended both DefaultScheduler and LinearScheduler with get_backlog_resource_requests API

Issues found:

Inline imports in clear_autoscaling_requests violate import placement style guide
Documentation has typo (singular "RaySwordfishActor" should be plural)
Missing DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS in environment variables documentation section

Confidence Score: 4/5

This PR is safe to merge with minor style issues
The implementation is solid with proper state tracking, blacklist mechanisms, and test coverage. Previous thread concerns about race conditions have been addressed. The main issues are style-related (inline imports, documentation typos) rather than functional bugs. Worker retirement logic correctly checks idle thresholds and respects minimum survivors.
No files require special attention - style issues are minor and easily addressed

Important Files Changed

File Analysis

Filename	Score	Overview
daft/runners/flotilla.py	5/5	Added empty bundles check in `try_autoscale` and new `clear_autoscaling_requests` helper function
src/daft-distributed/src/python/ray/worker.rs	5/5	Added actor state tracking, idle detection, and worker release mechanism with proper state transitions
src/daft-distributed/src/python/ray/worker_manager.rs	4/5	Implemented idle worker retirement logic with blacklist tracking and bootstrap handling, plus empty bundle handling
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs	4/5	Integrated downscaling logic with env var configuration, ratio-based and final idle worker retirement

Sequence Diagram

sequenceDiagram
    participant Scheduler as SchedulerActor
    participant WM as WorkerManager
    participant Worker as RaySwordfishWorker
    participant Ray as Ray Autoscaler

    Note over Scheduler: Main Loop Iteration
    
    Scheduler->>WM: worker_snapshots()
    WM-->>Scheduler: Current worker states
    
    Scheduler->>Scheduler: schedule_tasks()
    
    alt Scale-up needed
        Scheduler->>Scheduler: get_autoscaling_request()
        Scheduler->>WM: try_autoscale(bundles)
        WM->>WM: Check capacity vs demand
        alt Need more resources
            WM->>WM: Clear blacklist
            WM->>Ray: request_resources(bundles)
            Ray-->>WM: Scale-up triggered
        end
    else No scale-up & downscale enabled
        Scheduler->>Scheduler: Count idle workers
        alt Workers > min_survivor
            Scheduler->>WM: retire_idle_ray_workers(num_to_retire, false)
            WM->>WM: Find idle workers > threshold
            loop For each selected worker
                WM->>Worker: release(py)
                Worker->>Worker: shutdown()
                WM->>WM: Add to blacklist
            end
            WM->>Ray: clear_autoscaling_requests()
        end
    end
    
    Note over Scheduler: Job Complete
    
    alt Downscale enabled
        Scheduler->>WM: retire_idle_ray_workers(all, true)
        WM->>Worker: release() all idle
        WM->>Ray: clear_autoscaling_requests()
    end

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{9 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chatgpt-codex-connector · 2025-11-26T05:01:08Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

greptile-apps

_{9 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

This reverts commit a2ced04.

huleilei · 2025-11-26T13:29:17Z

@stayrascal help me review. Thanks.

huleilei · 2025-12-31T12:04:52Z

see #5903

github-actions Bot added the fix label Nov 9, 2025

huleilei marked this pull request as draft November 9, 2025 14:33

greptile-apps Bot reviewed Nov 9, 2025

View reviewed changes

huleilei marked this pull request as ready for review November 10, 2025 08:33

greptile-apps Bot reviewed Nov 10, 2025

View reviewed changes

Comment thread src/daft-distributed/src/python/ray/worker_manager.rs Outdated

huleilei marked this pull request as draft November 10, 2025 08:39

huleilei force-pushed the hll_gitlab/auto_scaler branch from 481d263 to 012ee13 Compare November 11, 2025 03:49

huleilei marked this pull request as ready for review November 11, 2025 03:50

huleilei marked this pull request as draft November 11, 2025 03:52

greptile-apps Bot reviewed Nov 11, 2025

View reviewed changes

Comment thread src/daft-distributed/src/python/ray/worker_manager.rs Outdated

huleilei force-pushed the hll_gitlab/auto_scaler branch from bbeed26 to bba4dc6 Compare November 14, 2025 09:18

huleilei marked this pull request as ready for review November 14, 2025 12:04

greptile-apps Bot reviewed Nov 14, 2025

View reviewed changes

huleilei marked this pull request as draft November 14, 2025 12:12

huleilei force-pushed the hll_gitlab/auto_scaler branch from acea710 to 908c963 Compare November 20, 2025 03:57

add downscale support via idle worker retirement

9b73170

huleilei force-pushed the hll_gitlab/auto_scaler branch from 908c963 to 9b73170 Compare November 20, 2025 06:20

huleilei changed the title ~~fix(autoscaling): add downscale support via idle worker retirement in…~~ feat: add downscale support via idle worker retirement in flotilla mode Nov 20, 2025

github-actions Bot added feat and removed fix labels Nov 20, 2025

Merge branch 'main' into hll_gitlab/auto_scaler

9eba5b8

huleilei marked this pull request as ready for review November 20, 2025 08:28

greptile-apps Bot reviewed Nov 20, 2025

View reviewed changes

Merge branch 'main' into hll_gitlab/auto_scaler

8544984

huleilei marked this pull request as draft November 21, 2025 03:39

Merge branch 'main' into hll_gitlab/auto_scaler

496e034

huleilei mentioned this pull request Nov 25, 2025

feat：Support for dynamic scaling of resources #5683

Open

huleilei added 2 commits November 25, 2025 22:55

delete no need code

a514c9d

change test case

728f60f

huleilei marked this pull request as ready for review November 26, 2025 05:01

greptile-apps Bot reviewed Nov 26, 2025

View reviewed changes

Comment thread daft/runners/flotilla.py

Comment thread docs/distributed/ray.md Outdated

Comment thread docs/distributed/ray.md

huleilei and others added 3 commits November 26, 2025 14:00

Update daft/runners/flotilla.py

a2ced04

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Update docs/distributed/ray.md

afe35a8

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Revert "Update daft/runners/flotilla.py"

5226bfa

This reverts commit a2ced04.

huleilei marked this pull request as draft November 28, 2025 08:01

huleilei changed the title ~~feat: add downscale support via idle worker retirement in flotilla mode~~ WIP： feat: add downscale support via idle worker retirement in flotilla mode Nov 28, 2025

huleilei closed this Dec 31, 2025

Conversation

huleilei commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps Bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Nov 26, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huleilei commented Nov 26, 2025

Uh oh!

huleilei commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huleilei commented Nov 9, 2025 •

edited

Loading

greptile-apps Bot commented Nov 11, 2025 •

edited

Loading