-
Notifications
You must be signed in to change notification settings - Fork 525
Description
What happened:
We are launching MPIJobs using a LocalQueue with kueue (in particular cpu-local-queue from the Yaml fround at the end of the issue). The ClusterQueue associated ResourceFlavor uses the appropriate nodeLabels to target a specific GKE nodepool. We are not setting the MPIJob NodeSelector when launching it. When launching the job, kueue sets the correct NodeSelector on the MPI job. However, the pods NodeSelector is empty. Note that we are not setting the suspend field in the MPIJob, I let kueue do it for us.
What you expected to happen:
The MPIJob pods should have the same NodeSelector as the MPIJob. This is also documented in https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/
Kueue adds the ResourceFlavor labels to the .nodeSelector of the underlying Workload Pod templates. This occurs if the Workload didn’t specify the ResourceFlavor labels already as part of its nodeSelector.
Environment:
GKE 1.30 + kueue 0.8.1 + waitForPodsReady=true. These are the kueue resources
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: cpu
spec:
nodeLabels:
cloud.google.com/gke-nodepool: e2x4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
name: cpu-prov-config
spec:
provisioningClassName: check-capacity.autoscaling.x-k8s.io
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
name: check-capacity-cpu-prov
spec:
controllerName: kueue.x-k8s.io/provisioning-request
parameters:
apiGroup: kueue.x-k8s.io
kind: ProvisioningRequestConfig
name: cpu-prov-config
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cpu-cluster-queue"
spec:
namespaceSelector: {}
preemption:
withinClusterQueue: LowerPriority
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "cpu"
resources:
- name: "cpu"
nominalQuota: "12"
- name: "memory"
nominalQuota: 52000Gi
admissionChecks:
- check-capacity-cpu-prov
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: cpu-local-queue
namespace: mynamespace
spec:
clusterQueue: cpu-cluster-queueThis can be replicated with the MPIOperator example. The launcher does not have NodeSelector set but the workers do have it.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: pi
namespace: mynamespace
labels:
kueue.x-k8s.io/queue-name: cpu-local-queue
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: None
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- -n
- "2"
- /home/mpiuser/pi
resources:
limits:
cpu: 1
memory: 1Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-worker
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
requests:
cpu: "1300m"
memory: 3Gi
limits:
cpu: "1300m"
memory: 3Gi