-
Notifications
You must be signed in to change notification settings - Fork 101
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
I'm having some trouble using the KAI-scheduler on my GB200-NVL72 cluster. I have been using KAI-scheduler for many months reliably on my H200 cluster.
The key difference between the two clusters, is that the GB200-NVL72 is one rack, but k8s sees 18 nodes of 4 GPUs. I create a computedomain for each cluster, in turn a dra driver pod gets scheduled on each node within the compute domain. However, when using KAI-scheduler, the dra driver pod does not always get scheduled.
For example, when scheduling a 17 node job (i.e. 68 GPUs), I get the error:
condition message: PodSchedulingErrors: Resources were found for 5 pods while 17 are required for gang scheduling. Additional pods cannot be scheduled due to: no nodes with enough resources were found: 12 cannot allocate all claims.
If instead I use the default k8s scheduler, the job runs as expected.
What did you expect to happen?
No response
Environment
- Kubernetes version
- KAI Scheduler version
- Cloud provider or hardware configuration
- Tools that you are using KAI together with
- Anything else that is relevant
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working