Skip to content

[Fix][operator] Fix volcano podgroup stuck in inqueue state after rayjob completes#4476

Open
fangyinc wants to merge 3 commits intoray-project:masterfrom
fangyinc:issues4473
Open

[Fix][operator] Fix volcano podgroup stuck in inqueue state after rayjob completes#4476
fangyinc wants to merge 3 commits intoray-project:masterfrom
fangyinc:issues4473

Conversation

@fangyinc
Copy link
Contributor

@fangyinc fangyinc commented Feb 2, 2026

Why are these changes needed?

When a RayJob completes (SUCCEEDED/FAILED), the associated RayCluster is deleted (when shutdownAfterJobFinishes: true), but the Volcano PodGroup remains stuck in Inqueue state. This causes queue resources to remain occupied even though the job has finished, preventing new jobs from being scheduled.

Root Cause

KubeRay never implemented cleanup logic for Volcano PodGroups. When RayJob completes:

  1. RayCluster is deleted
  2. PodGroup persists (OwnerReference points to RayJob, not RayCluster)
  3. Volcano scheduler continuously recalculates PodGroup status based on pod counts
  4. With 0 running pods, the status gets reset to PendingInqueue by Volcano's control loop

Solution

Added a CleanupOnCompletion() method to the BatchScheduler interface that deletes the PodGroup when RayJob reaches terminal state. Deleting is necessary because marking as Completed doesn't work - Volcano's scheduler overrides the status in its next cycle.

Related issue number

Closes #4473

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@fangyinc fangyinc changed the title fix: Fix volcano podgroup stuck in inqueue state after rayjob completes [Fix][operator] Fix volcano podgroup stuck in inqueue state after rayjob completes Feb 2, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@fangyinc
Copy link
Contributor Author

fangyinc commented Feb 4, 2026

@Future-Outlier @andrewsykim PTAL. Thanks you~

The failed CI does not seem to be caused by this PR.

Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Volcano PodGroup Stuck in Inqueue State After RayJob Completes

1 participant