Skip to content

[Core] Add fallback strategy support to placement groups#59024

Open
ryanaoleary wants to merge 14 commits intoray-project:masterfrom
ryanaoleary:pg-fallback-strategy
Open

[Core] Add fallback strategy support to placement groups#59024
ryanaoleary wants to merge 14 commits intoray-project:masterfrom
ryanaoleary:pg-fallback-strategy

Conversation

@ryanaoleary
Copy link
Contributor

@ryanaoleary ryanaoleary commented Nov 27, 2025

Description

This PR adds a fallback_strategy argument to placement groups. For Ray placement groups, fallback_strategy is a list of scheduling option dicts, where the arguments can be bundles and (optionally) bundle_label_selector, to try in order until one can be scheduled. This PR also updates the GCS scheduler to consider these fallbacks when calling ScheduleUnplacedBundles. If a fallback strategy is chosen, the bundles in the placement group table data will be updated to the selected value.

Edit: We've updated the proto fields where bundles now only shows the actively chosen bundles for scheduling, and scheduling_strategy contains the list of all scheduling options (primary bundles and fallback bundles).

Related issues

#51564

@ryanaoleary ryanaoleary requested a review from a team as a code owner November 27, 2025 02:05
@ryanaoleary
Copy link
Contributor Author

cc: @MengjinYan

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback_strategy for placement groups, allowing for alternative scheduling options if the primary configuration is not feasible. The changes are well-implemented across the Python API, Cython layer, and C++ core, including updates to the GCS scheduler. The addition of comprehensive unit and integration tests ensures the new functionality is robust. The code is well-structured, and I've identified one minor opportunity for code simplification to enhance maintainability.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 27, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 11, 2025
@MengjinYan
Copy link
Contributor

not stale

@MengjinYan MengjinYan self-assigned this Dec 11, 2025
@MengjinYan MengjinYan removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 11, 2025
Copy link
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level comments. Will take a more detailed look after those are resolved. Thanks!

Copy link
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience! Added one high level comment.

Comment on lines 65 to 66
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the new semantics of the bundles field, I think the field should be empty in the beginning. And probably same with the strategy if we add it to the placement group scheduling options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried adding this and it runs into a lot of issues with the python client and the autoscaler, since they expect bundles to be populated for the pending placement group. To fix it we end up having to modify how bundles get cached and essentially return scheduling_strategy index 0 (the primary strategy) anytime bundles would get called anyway.

The change I tried to implement ended up being pretty expansive, is this something we should try to add in a follow-up PR? Or do you know if I might be missing something and there's an easier way to support the new semantics of the field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I think it is a very involved change because the semantic changes.

The current placement group management logic in all places assumes that there is only one set of requirements for each placement group so the requirements and the state tracking are all in the bundles field.

At the same time, with the fallback semantic, while the user facing API remains the bundles will still be used for specifying the highest priority resource requirement, we changed the assumption in GCS placement group management logic and split the original bundles field in PlacementGroupTableData into:

  • scheduling_strategy which only contains the scheduling requirements for the bundles
  • and bundles which only tracks the bundle id and the corresponding node id for the currently chosen placement group. Only bundle_id and node_id needs to be populated and we might need an additional field to indicate the index of the chosen set of requirements in the scheduling_strategy

Looking into the code, I think on a high level, the semantic changes means:

  • The user facing APIs should remain the same in terms of the bundles field and just add fallback_options additional field:
    • semantics in user facing APIs (placement_group.py) shouldn't change
    • semantics in PlacementGroupSpecification, BundleSpecification etc. shouldn't change
  • On GCS side, the semantics in GcsPlacementGroup should change to the new semantics. This means:
    (1) The operations related bundle field (GetBundles, GetUnplacedBundles, etc.) should be changed to the semantics of assuming this is the chosen resource requirements. cached_bundle_specs_ should be the chosen
    (2) The placement group scheduling logic needs to consider all the resource requirements in scheduling_strategy when the whole placement group needs to be scheduled.
    (3) The logic to sync the pending placement groups from GCS to autoscaler needs to provide the list of resource requirements

In terms of the PRs, looks like the current PR handles (2) and related changes to update the PlacementGroupTableData and PlacementGroupSpec data model. And if the current didn't change the existing behavior of having one set of requirements in bundles, I think we can add the additional changes that covers (1) and (3) in separate PRs.

Let me know whether the above makes sense or I missed some aspects in handling the bundles semantic changes.

cc: @edoakes for visibility and if you have additional comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for outlining it I feel like that made it a lot more clear to implement. I went through and think with af673db I've implemented everything from (2) and part of (1). I changed bundles to remain empty until an active strategy has been selected, but updated GetBundles to return the highest priority strategy when bundles is empty so that it still works with pending PGs. I think a follow-up PR would still be needed so that this field fully follow the new semantics and returns an empty list before a strategy has been chosen (and also updates all downstream consumers to check for

For the autoscaler, I updated GetPendingGangResourceRequests since it reads from the proto to check for scheduling_strategy(0) for PENDING groups, and then in a follow-up PR that addresses the autoscaling side can have it actually consider the full list of fallback options when scheduling.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

ryanaoleary and others added 3 commits February 7, 2026 02:39
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
@ryanaoleary
Copy link
Contributor Author

cc: @MengjinYan I had a bad rebase that ended up dropping some of my changes to the files, I've now added them back in and re-tested. The semantics of the fields is now being followed per #59024 (comment).

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

ryanaoleary and others added 2 commits February 7, 2026 07:42
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

// Label selector associated with this bundle.
// Populated from bundle_label_selector if provided.
map<string, string> label_selector = 4;
LabelSelector label_selector = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto files modified without RPC fault-tolerance review

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Additional Locations (1)

Fix in Cursor Fix in Web

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

// Label selector associated with this bundle.
// Populated from bundle_label_selector if provided.
map<string, string> label_selector = 4;
LabelSelector label_selector = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto files modified require RPC fault-tolerance review

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Additional Locations (1)

Fix in Cursor Fix in Web

// Index 1..N: fallback strategies.
placement_group_table_data_.mutable_scheduling_options()->MergeFrom(
placement_group_spec.fallback_strategy());

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty bundles field breaks bundle index validation

High Severity

The GcsPlacementGroup constructor no longer populates the bundles field (it only populates scheduling_options). This means bundles is empty for all newly created placement groups until the scheduler calls UpdateActiveBundles. Since HandleGetPlacementGroup serves from in-memory state, client-side bundle_count returns 0 before scheduling runs. This causes check_placement_group_index to reject any placement_group_bundle_index >= 0 with a ValueError. The race window can be significant when other PGs are scheduling concurrently (IsSchedulingInProgress blocks the queue).

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants