-
Notifications
You must be signed in to change notification settings - Fork 698
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
What happened
After a RayJob completes successfully (status SUCCEEDED / Complete), the associated Volcano PodGroup remains stuck in Inqueue state indefinitely, instead of transitioning to Completed or being deleted.
Production environment example:
kubectl get rayjob
# NAME STATUS JOBDEPLOYMENTSTATUS ...
# ray-xxxxx-ray-job-xxxxx SUCCEEDED Complete
kubectl get podgroup
# NAME PHASE MINMEMBER AGE
# ray-xxxxx-ray-job-xxxxx-pg Inqueue 31 2d1h
# ray-yyyyy-ray-job-yyyyy-pg Inqueue 21 8d
kubectl describe podgroup ray-xxxxx-ray-job-xxxxx-pg
# Status:
# Phase: Inqueue
# Succeeded: 1
# Conditions:
# Type: Unschedulable
# Message: 30/1 tasks in gang unschedulable: pod group is not ready, 1 Succeeded, 31 minAvailable
# Events:
# Warning Unschedulable volcano 0/1 tasks in gang unschedulable...The RayJob is SUCCEEDED, but the PodGroup stays in Inqueue with Volcano continuously trying to schedule pods that no longer exist.
What you expected to happen
When a RayJob completes successfully, the PodGroup should either:
- Transition to
Completedstatus, OR - Be deleted automatically
The PodGroup should not remain in Inqueue state after the job has finished.
Root Cause Analysis
From code analysis, the issue stems from:
-
PodGroup OwnerReference (
volcano_scheduler.go:198-211):- For RayJobs, PodGroup's
OwnerReferencepoints to the RayJob, not the RayCluster - Kubernetes garbage collection only deletes PodGroup when RayJob is deleted
- For RayJobs, PodGroup's
-
No cleanup logic in RayJob completion handler (
rayjob_controller.go:410-431):- When RayJob reaches terminal state (Complete/Failed), default behavior is to do nothing
- There is NO logic to update or delete the associated PodGroup
-
shutdownAfterJobFinishes only deletes RayCluster (
rayjob_controller.go:1331-1372):- Only deletes RayCluster, not the PodGroup
- Comment: "We don't need to delete the submitter Kubernetes Job so that users can still access the driver logs"
- PodGroup is completely overlooked
-
This issue has existed since initial Volcano integration (PR [Feature] Support Volcano for batch scheduling #755, Dec 2022):
- PodGroup creation logic never included cleanup on RayJob completion
BatchSchedulerinterface only hasDoBatchSchedulingOnSubmission, no cleanup method
Impact
- PodGroups accumulate indefinitely (observed stuck for 8d, 19d in production)
- Volcano scheduler wastes resources trying to schedule non-existent pods
- User confusion: completed jobs appear to still be waiting in queue
- Difficult to distinguish between actually queued jobs vs finished jobs
Reproduction script
# Step 1: Create a queue with limited resources
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ray-queue
spec:
weight: 1
reclaimable: false
capability:
cpu: 4
memory: 8Gi
---
# Step 2: Create RayJob
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: test-ray-job-reproduce
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: ray-queue
spec:
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 0
rayClusterSpec:
rayVersion: "2.53.0"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.53.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: worker-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.53.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
submissionMode: K8sJobMode
entrypoint: python -c "import ray; ray.init(); print('Job started'); import time; time.sleep(30); print('Job completed successfully')"
activeDeadlineSeconds: 600
backoffLimit: 0Steps to reproduce
-
Create the queue with limited resources (4 CPU, 8Gi):
kubectl apply -f queue.yaml
-
Create the first RayJob (requests 3 CPU, 6Gi):
kubectl apply -f rayjob-1.yaml
-
Wait for the first RayJob to complete:
kubectl get rayjob test-ray-job-1 # STATUS: SUCCEEDED, JOBDEPLOYMENTSTATUS: Complete -
Check the first PodGroup status - BUG: stuck in
Inqueueinstead ofCompleted:kubectl get podgroup ray-test-ray-job-1-pg # PHASE: Inqueue (should be Completed or deleted) -
Create a second RayJob with the same resource requirements:
kubectl apply -f rayjob-2.yaml
-
Observe the second PodGroup - BUG: stuck in
Pendingindefinitely:kubectl get podgroup ray-test-ray-job-2-pg # PHASE: Pending (should be able to run since the first job is done) kubectl describe podgroup ray-test-ray-job-2-pg # Events: queue resource quota insufficient
The podgroup ray-test-ray-job-2-pg's events like:
Type Reason Age From Message ---- ------ ---- ---- ------- Normal Unschedulable 20s (x25 over 44s) volcano queue resource quota insufficient: insufficient cpu, insufficient memory Warning Unschedulable 20s (x25 over 44s) volcano 3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 UnschedulableThe second RayJob cannot run because the first PodGroup still holds the queue resources, even though the first RayJob has already completed.
Anything else
Environment
- Kubernetes version: v1.29 and v1.34 all reproduce the same issue
- KubeRay version: v1.5.1
- Volcano version: v1.14.0
Possible solutions
-
Option 1 (Recommended): Update PodGroup to
Completedwhen RayJob reaches terminal state- Add cleanup method to
BatchSchedulerinterface - Call when RayJob transitions to Complete/Failed
- Add cleanup method to
-
Option 2: Delete PodGroup when RayJob completes
- Simpler but loses scheduling history
Are you willing to submit a PR?
- Yes I am willing to submit a PR!