Skip to content

ComputeDomain on GB200 #577

@Phlip79

Description

@Phlip79

What happened?

I'm having some trouble using the KAI-scheduler on my GB200-NVL72 cluster. I have been using KAI-scheduler for many months reliably on my H200 cluster.

The key difference between the two clusters, is that the GB200-NVL72 is one rack, but k8s sees 18 nodes of 4 GPUs. I create a computedomain for each cluster, in turn a dra driver pod gets scheduled on each node within the compute domain. However, when using KAI-scheduler, the dra driver pod does not always get scheduled.

For example, when scheduling a 17 node job (i.e. 68 GPUs), I get the error:

condition message: PodSchedulingErrors: Resources were found for 5 pods while 17 are required for gang scheduling. Additional pods cannot be scheduled due to: no nodes with enough resources were found: 12 cannot allocate all claims. 

If instead I use the default k8s scheduler, the job runs as expected.

What did you expect to happen?

No response

Environment

  • Kubernetes version
  • KAI Scheduler version
  • Cloud provider or hardware configuration
  • Tools that you are using KAI together with
  • Anything else that is relevant

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions