Skip to content

[Core, Train] Cluster does not downscale despite no jobs running, possibly due to PlacementGroupCleaner #61689

@jleben

Description

@jleben

What happened + What you expected to happen

Recently (with Ray 2.54.0) I have observed nodes from autoscaling worker groups considered Active and not being deleted, even though no job is running.

This happened after running a Ray Train job. I noticed the problematic nodes still had active PlacementGroupCleaner actors on them (sometimes multiple such actors on a single node). I know the placement group cleaner was added recently, so I wonder if that's the main culprit.

This is a significant issue for me, because it wastes money on nodes that are doing no work.

Versions / Dependencies

Ray 2.54.0
Kuberay 1.5.0

Reproduction script

Not reliably reproducible. Try running some Ray Train jobs that scale up the cluster by using a placement group.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CoreperformancestabilitytrainRay Train Related IssuetriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions