Skip to content

WIP: feat: add downscale support via idle worker retirement in flotilla mode#5516

Closed
huleilei wants to merge 9 commits into
Eventual-Inc:mainfrom
huleilei:hll_gitlab/auto_scaler
Closed

WIP: feat: add downscale support via idle worker retirement in flotilla mode#5516
huleilei wants to merge 9 commits into
Eventual-Inc:mainfrom
huleilei:hll_gitlab/auto_scaler

Conversation

@huleilei

@huleilei huleilei commented Nov 9, 2025

Copy link
Copy Markdown
Collaborator

Root cause

  • Autoscaling only triggers scale-up when backlog ratio exceeds threshold, and RayWorkerManager enforces a monotonic upper bound (resource request must exceed historical max).
  • Daft uses ray.autoscaler.sdk.request_resources which requests incremental capacity only; there is no proactive retirement of idle Swordfish workers, so Ray never sees nodes idle enough to downscale.

What’s changed

  • Scheduler loop (scheduler_actor.rs):
    • New env configs:
      • DAFT_AUTOSCALING_DOWNSCALE_ENABLED (default: true)
      • DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS (default: 1)
  • WorkerManager trait: added retire_idle_workers(max_to_retire) to enable safe downscale.
  • RayWorkerManager:
    • Implemented retire_idle_workers: selects idle workers (no active tasks) whose idle duration exceeds DAFT_AUTOSCALING_DOWNSCALE_IDLE_SECONDS (default: 120s), and shuts down corresponding RaySwordfishActors.
    • Kept existing scale-up logic via request_resources with monotonic upper bound.
  • RaySwordfishWorker:
    • Added last_task_finished_at and idle detection helpers (is_idle, idle_duration(now)). Updated mark_task_finished to record last activity.
    • Preserved shutdown() binding to kill the underlying actor.
  • Scheduler (Default/Linear):
    • Added get_backlog_resource_requests API to expose backlog demand to the scheduler actor for downscale decision.
  • Flotilla runner: documented that try_autoscale proxies Ray’s scale-up only API; downscale is driven by proactive worker retirement.

Changes Made

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

@github-actions github-actions Bot added the fix label Nov 9, 2025
@huleilei huleilei marked this pull request as draft November 9, 2025 14:33

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR implements bidirectional autoscaling for Daft's flotilla mode by adding downscale support via idle worker retirement. Previously, the system could only scale up; now it can also scale down when utilization is low.

Key changes:

  • Scheduler actor (scheduler_actor.rs): Added ratio-based downscale logic that monitors backlog/capacity ratio and retires idle workers when ratio stays below 0.75 for 10+ ticks. Includes bootstrap expansion for zero-capacity clusters and global idle detection for full cleanup.

  • Worker manager (worker_manager.rs): Implemented retire_idle_workers() and release_idle_actors() methods with idle duration tracking (default 120s threshold). Scale-up maintains monotonic upper bound to prevent Ray autoscaler conflicts.

  • Worker tracking (worker.rs): Added last_task_finished_at, is_idle(), and idle_duration() to track worker idle state for safe retirement decisions.

  • Python integration (flotilla.py): Added actor lifecycle management functions: clear_autoscaling_requests(), sweep_force_release_swordfish_actors(), and force_release_swordfish_actor() to properly clean up Ray actors.

  • Configuration: 6 new environment variables for tuning downscale behavior (thresholds, stability windows, limits).

How it works:

  1. Scale-up continues via existing ratio-based autoscaling (ratio > 1.25)
  2. Ratio-based downscale: When backlog/capacity < 0.75 for 10 ticks, retire up to 10% of idle workers
  3. Global idle: When backlog=0 and no inflight tasks, release all idle workers
  4. Finalize: Clear all idle actors and reset Ray demand on job completion

The implementation includes comprehensive tests covering bootstrap expansion, ratio-based scaling, and downscale stabilization.

Confidence Score: 3/5

  • This PR introduces complex autoscaling logic with potential race conditions between downscale paths
  • Score reflects two critical logical issues: (1) potential race between ratio-based and global idle downscale branches that could attempt to release workers twice in the same tick, and (2) stale worker snapshot state after ratio-based downscale could cause incorrect global idle detection. The core implementation is sound with good test coverage, but these edge cases need resolution before merge.
  • Pay close attention to scheduler_actor.rs lines 290-370 where the downscale logic has potential race conditions between the ratio-based and global idle branches

Important Files Changed

File Analysis

Filename Score Overview
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs 3/5 Added downscale logic with environment-based configuration, stability windows, and worker retirement. Includes bootstrap expansion debounce and global idle detection. Logic appears sound but there are potential race conditions between ratio-based and global idle branches.
src/daft-distributed/src/python/ray/worker_manager.rs 4/5 Implements idle worker retirement with configurable thresholds, monotonic upper bound for scale-up, and proper cleanup via sweep_force_release_swordfish_actors. The release_idle_actors and retire_idle_workers methods provide controlled downscaling.
daft/runners/flotilla.py 4/5 Added clear_autoscaling_requests(), list_swordfish_actors(), force_release_swordfish_actor(), and sweep_force_release_swordfish_actors() functions for actor lifecycle management. Uses deprecated ray.state.actors() with warning. Retry logic present for robustness.

Sequence Diagram

sequenceDiagram
    participant Scheduler as Scheduler Actor
    participant WM as Worker Manager
    participant Ray as Ray Autoscaler
    participant Workers as Ray Workers
    
    Note over Scheduler: Every tick (1s interval)
    
    Scheduler->>WM: worker_snapshots()
    WM-->>Scheduler: Current worker state
    
    Scheduler->>Scheduler: Calculate backlog/capacity ratio
    
    alt Bootstrap: Zero capacity + backlog > 0
        Scheduler->>WM: try_autoscale(backlog_requests)
        WM->>Ray: request_resources(bundles)
        Ray-->>Workers: Provision new nodes
    end
    
    alt Scale-up: ratio > autoscaling_threshold
        Scheduler->>WM: try_autoscale(pending_tasks)
        WM->>Ray: request_resources(bundles)
        Note over Scheduler: Reset downscale_ticks to 0
    end
    
    alt Ratio-based downscale: ratio < 0.75 for 10+ ticks
        Scheduler->>Scheduler: Increment downscale_below_threshold_ticks
        alt Stable & num_workers > min_survivors
            Scheduler->>WM: release_idle_actors(num_to_retire, false)
            WM->>WM: Select idle workers (idle > 120s)
            WM->>Workers: shutdown() idle actors
            WM->>Ray: sweep_force_release_swordfish_actors()
            Note over Scheduler: Reset downscale_ticks to 0
        end
    end
    
    alt Global idle: backlog=0 & no inflight tasks
        Scheduler->>WM: release_idle_actors(all, true)
        WM->>Workers: shutdown() all idle actors
        Scheduler->>WM: try_autoscale([])
        WM->>Ray: request_resources([]) to clear demand
        Note over Scheduler: Reset downscale_ticks to 0
    end
    
    Note over Scheduler: Job completion
    Scheduler->>WM: release_idle_actors(all, true)
    Scheduler->>WM: try_autoscale([])
    WM->>Ray: Clear resource requests
Loading

9 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs Outdated
Comment thread src/daft-distributed/src/python/ray/worker_manager.rs Outdated
Comment thread daft/runners/flotilla.py Outdated
Comment thread src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs Outdated
@huleilei huleilei marked this pull request as ready for review November 10, 2025 08:33

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical iterator invalidation bug in the worker retirement logic introduced in a previous commit. The original code was removing workers from a HashMap while iterating over the same collection, which could cause undefined behavior.

Key Changes:

  • Refactored internal_release_idle_workers to collect workers into a separate Vec before removal
  • Workers are now removed from the HashMap after the iteration is complete, eliminating the race condition
  • The fix ensures that the selected iterator is consumed before any mutations to state.ray_workers

Issue Found:

  • Lines 348-349 contain Chinese comments that should be translated to English for consistency

Confidence Score: 4/5

  • This PR is safe to merge - it fixes a critical iterator invalidation bug
  • The fix properly addresses the iterator invalidation issue by collecting worker IDs before removal. The logic is sound and prevents the HashMap from being mutated during iteration. Minor style issue with Chinese comments doesn't affect correctness.
  • No files require special attention - the fix is straightforward and correct

Important Files Changed

File Analysis

Filename Score Overview
src/daft-distributed/src/python/ray/worker_manager.rs 4/5 Fixes iterator invalidation bug by collecting workers before removal; adds Chinese comments that should be in English

Sequence Diagram

sequenceDiagram
    participant Scheduler as Scheduler Actor
    participant WM as RayWorkerManager
    participant State as WorkerManagerState
    participant Worker as RaySwordfishWorker
    participant Ray as Ray/Flotilla

    Note over Scheduler,Ray: Downscale Flow (idle worker retirement)
    
    Scheduler->>WM: retire_idle_workers(max_to_retire)
    WM->>State: Lock state mutex
    WM->>State: Iterate workers to find idle candidates
    State-->>WM: Return idle workers list
    
    Note over WM: Collect worker IDs (no removal yet)
    WM->>WM: Sort by idle duration (longest first)
    WM->>WM: Take up to max_to_retire workers
    
    Note over WM,State: Safe removal phase
    loop For each selected worker
        WM->>State: remove worker from HashMap
        State-->>WM: Return worker object
        WM->>WM: Add to workers_to_release Vec
    end
    
    Note over WM: Release phase (after iteration)
    loop For each worker in workers_to_release
        WM->>Worker: release(py)
        Worker->>Worker: Check no active tasks
        Worker->>Worker: Set state to Releasing
        Worker->>Ray: shutdown()
        Worker->>Worker: Set state to Released
    end
    
    WM->>Ray: clear_autoscaling_requests()
    WM->>State: Get remaining worker IDs
    WM->>Ray: sweep_force_release_swordfish_actors(exclude_ids)
    WM-->>Scheduler: Return number released
Loading

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/daft-distributed/src/python/ray/worker_manager.rs Outdated
@huleilei huleilei marked this pull request as draft November 10, 2025 08:39
@huleilei huleilei force-pushed the hll_gitlab/auto_scaler branch from 481d263 to 012ee13 Compare November 11, 2025 03:49
@huleilei huleilei marked this pull request as ready for review November 11, 2025 03:50
@huleilei huleilei marked this pull request as draft November 11, 2025 03:52
@greptile-apps

greptile-apps Bot commented Nov 11, 2025

Copy link
Copy Markdown
Contributor

Greptile Overview

Greptile Summary

This PR implements downscaling support for Daft's Ray-based autoscaling by enabling proactive idle worker retirement. The implementation adds environment-driven configuration for downscaling thresholds and survivor worker counts, tracks worker idle states with timestamps, and implements a blacklist mechanism to prevent immediate worker respawn during scale operations.

Key changes:

  • Added idle worker retirement via retire_idle_ray_workers in the WorkerManager trait
  • Implemented ratio-based downscaling in the scheduler loop that respects minimum survivor worker counts
  • Added worker state tracking (ActorState, last_task_finished_at) for idle detection
  • Introduced pending release blacklist to prevent workers from being immediately reprovisioned
  • Added clear_autoscaling_requests helper to reset Ray's autoscaler demand
  • Extended both DefaultScheduler and LinearScheduler with get_backlog_resource_requests API

Issues found:

  • Inline imports in clear_autoscaling_requests violate import placement style guide
  • Documentation has typo (singular "RaySwordfishActor" should be plural)
  • Missing DAFT_AUTOSCALING_MIN_SURVIVOR_WORKERS in environment variables documentation section

Confidence Score: 4/5

  • This PR is safe to merge with minor style issues
  • The implementation is solid with proper state tracking, blacklist mechanisms, and test coverage. Previous thread concerns about race conditions have been addressed. The main issues are style-related (inline imports, documentation typos) rather than functional bugs. Worker retirement logic correctly checks idle thresholds and respects minimum survivors.
  • No files require special attention - style issues are minor and easily addressed

Important Files Changed

File Analysis

Filename Score Overview
daft/runners/flotilla.py 5/5 Added empty bundles check in try_autoscale and new clear_autoscaling_requests helper function
src/daft-distributed/src/python/ray/worker.rs 5/5 Added actor state tracking, idle detection, and worker release mechanism with proper state transitions
src/daft-distributed/src/python/ray/worker_manager.rs 4/5 Implemented idle worker retirement logic with blacklist tracking and bootstrap handling, plus empty bundle handling
src/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs 4/5 Integrated downscaling logic with env var configuration, ratio-based and final idle worker retirement

Sequence Diagram

sequenceDiagram
    participant Scheduler as SchedulerActor
    participant WM as WorkerManager
    participant Worker as RaySwordfishWorker
    participant Ray as Ray Autoscaler

    Note over Scheduler: Main Loop Iteration
    
    Scheduler->>WM: worker_snapshots()
    WM-->>Scheduler: Current worker states
    
    Scheduler->>Scheduler: schedule_tasks()
    
    alt Scale-up needed
        Scheduler->>Scheduler: get_autoscaling_request()
        Scheduler->>WM: try_autoscale(bundles)
        WM->>WM: Check capacity vs demand
        alt Need more resources
            WM->>WM: Clear blacklist
            WM->>Ray: request_resources(bundles)
            Ray-->>WM: Scale-up triggered
        end
    else No scale-up & downscale enabled
        Scheduler->>Scheduler: Count idle workers
        alt Workers > min_survivor
            Scheduler->>WM: retire_idle_ray_workers(num_to_retire, false)
            WM->>WM: Find idle workers > threshold
            loop For each selected worker
                WM->>Worker: release(py)
                Worker->>Worker: shutdown()
                WM->>WM: Add to blacklist
            end
            WM->>Ray: clear_autoscaling_requests()
        end
    end
    
    Note over Scheduler: Job Complete
    
    alt Downscale enabled
        Scheduler->>WM: retire_idle_ray_workers(all, true)
        WM->>Worker: release() all idle
        WM->>Ray: clear_autoscaling_requests()
    end
Loading

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread src/daft-distributed/src/python/ray/worker_manager.rs Outdated
@huleilei huleilei force-pushed the hll_gitlab/auto_scaler branch from bbeed26 to bba4dc6 Compare November 14, 2025 09:18
@huleilei huleilei marked this pull request as ready for review November 14, 2025 12:04

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@huleilei huleilei marked this pull request as draft November 14, 2025 12:12
@huleilei huleilei force-pushed the hll_gitlab/auto_scaler branch from acea710 to 908c963 Compare November 20, 2025 03:57
@huleilei huleilei force-pushed the hll_gitlab/auto_scaler branch from 908c963 to 9b73170 Compare November 20, 2025 06:20
@huleilei huleilei changed the title fix(autoscaling): add downscale support via idle worker retirement in… feat: add downscale support via idle worker retirement in flotilla mode Nov 20, 2025
@github-actions github-actions Bot added feat and removed fix labels Nov 20, 2025
@huleilei huleilei marked this pull request as ready for review November 20, 2025 08:28

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@huleilei huleilei marked this pull request as draft November 21, 2025 03:39
@huleilei huleilei marked this pull request as ready for review November 26, 2025 05:01
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread daft/runners/flotilla.py
Comment thread docs/distributed/ray.md Outdated
Comment thread docs/distributed/ray.md
huleilei and others added 3 commits November 26, 2025 14:00
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@huleilei

Copy link
Copy Markdown
Collaborator Author

@stayrascal help me review. Thanks.

@huleilei huleilei marked this pull request as draft November 28, 2025 08:01
@huleilei huleilei changed the title feat: add downscale support via idle worker retirement in flotilla mode WIP: feat: add downscale support via idle worker retirement in flotilla mode Nov 28, 2025
@huleilei

Copy link
Copy Markdown
Collaborator Author

see #5903

@huleilei huleilei closed this Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant