Skip to content

[Bug]: CUJ1 Regression, KAI scheduler fails to install #165

@mchmarny

Description

@mchmarny

Prerequisites

  • I searched existing issues and found no duplicates
  • I can reproduce this issue consistently
  • This is not a security vulnerability (use Security Advisories instead)

Bug Description

When executing ./deploy.sh I see this error:

Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded

Impact

Blocking (cannot proceed)

Component

CLI (eidos)

Regression?

Yes, this worked before (please specify version below)

Steps to Reproduce

Starting with cluster with 3 system nodes and 2 GPU nodes, and 1 CPU node

$ k get nodes -o wide
NAME                           STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
ip-10-0-11-191.ec2.internal    Ready    <none>   102m   v1.34.3   10.0.11.191    <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-131-78.ec2.internal    Ready    <none>   102m   v1.32.5   10.0.131.78    <none>        Ubuntu 22.04.5 LTS   6.8.0-1028-aws    containerd://1.7.27
ip-10-0-207-134.ec2.internal   Ready    <none>   102m   v1.32.5   10.0.207.134   <none>        Ubuntu 22.04.5 LTS   6.8.0-1028-aws    containerd://1.7.27
ip-10-0-222-107.ec2.internal   Ready    <none>   102m   v1.34.3   10.0.222.107   <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-4-118.ec2.internal     Ready    <none>   102m   v1.34.3   10.0.4.118     <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-7-217.ec2.internal     Ready    <none>   102m   v1.34.3   10.0.7.217     <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28

Pods prior to installation:

​​$ k get pods -A
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   aws-node-2ck2s             2/2     Running   0          4m50s
kube-system   aws-node-7tmjt             2/2     Running   0          5m20s
kube-system   aws-node-djccv             2/2     Running   0          4m53s
kube-system   aws-node-f4n96             2/2     Running   0          4m56s
kube-system   aws-node-hd2tt             2/2     Running   0          5m2s
kube-system   aws-node-r66w4             2/2     Running   0          4m58s
kube-system   coredns-58d8fddfd7-2vb8w   1/1     Running   0          25m
kube-system   coredns-58d8fddfd7-6grj9   1/1     Running   0          25m
kube-system   kube-proxy-4fwpr           1/1     Running   0          24m
kube-system   kube-proxy-7vngs           1/1     Running   0          23m
kube-system   kube-proxy-88wfr           1/1     Running   0          23m
kube-system   kube-proxy-cbfrc           1/1     Running   0          24m
kube-system   kube-proxy-hd897           1/1     Running   0          24m
kube-system   kube-proxy-tt7vb           1/1     Running   0          24m

Connect to cluster:

aws eks update-kubeconfig --region us-east-1 --name aicr-demo --alias aicr-demo

Gen recipe:

eidos recipe \
  --service eks \
  --accelerator h100 \
  --intent training \
  --os ubuntu \
  --platform kubeflow \
  --output recipe.yaml

Gen bundle:

eidos bundle \
  --recipe recipe.yaml \
  --accelerated-node-selector nodeGroup=gpu-worker\
  --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
  --output bundle

Deploy:

cd ./bundle && chmod +x deploy.sh && ./deploy.sh

Output:

Deploying Cloud Native Stack components...
Installing aws-ebs-csi-driver (kube-system)...
Release "aws-ebs-csi-driver" does not exist. Installing it now.
I0220 05:28:45.656395   91946 warnings.go:107] "Warning: spec.template.spec.containers[1].ports[0]: duplicate port name \"healthz\" with spec.template.spec.containers[0].ports[0], services and probes that select ports by name will use spec.template.spec.containers[0].ports[0]"
NAME: aws-ebs-csi-driver
LAST DEPLOYED: Fri Feb 20 05:28:38 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
To verify that aws-ebs-csi-driver has started, run:

    kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-ebs-csi-driver,app.kubernetes.io/instance=aws-ebs-csi-driver"

The "a1CompatibilityDaemonSet" parameter has been removed. For more information see the EBS CSI Helm Chart changelog:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/CHANGELOG.md#2550
Installing aws-efa (kube-system)...
Release "aws-efa" does not exist. Installing it now.
NAME: aws-efa
LAST DEPLOYED: Fri Feb 20 05:29:08 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
EFA device plugin is installed, it can be requested as `vpc.amazonaws.com/efa` resource.
Installing cert-manager (cert-manager)...
Release "cert-manager" does not exist. Installing it now.
NAME: cert-manager
LAST DEPLOYED: Fri Feb 20 05:29:13 2026
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
⚠️  WARNING: `installCRDs` is deprecated, use `crds.enabled` instead.
cert-manager v1.17.2 has been deployed successfully!

In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).

More information on the different types of issuers and how to configure them
can be found in our documentation:

https://cert-manager.io/docs/configuration/

For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:

https://cert-manager.io/docs/usage/ingress/
Installing gpu-operator (gpu-operator)...
Release "gpu-operator" does not exist. Installing it now.
I0220 05:29:58.356053   93092 warnings.go:107] "Warning: spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use \"node-role.kubernetes.io/control-plane\" instead"
NAME: gpu-operator
LAST DEPLOYED: Fri Feb 20 05:29:54 2026
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded

Debug:

$ k get pods -A
NAMESPACE       NAME                                             READY   STATUS    RESTARTS   AGE
cert-manager    cert-manager-85888c5d66-qfljj                    1/1     Running   0          22m
cert-manager    cert-manager-cainjector-7476657f99-sltcp         1/1     Running   0          22m
cert-manager    cert-manager-webhook-6655cc77b8-n7psl            1/1     Running   0          22m
gpu-operator    gpu-operator-dc849bdc7-mn5vh                     1/1     Running   0          22m
gpu-operator    node-feature-discovery-gc-66bb6c8796-9bl4s       1/1     Running   0          22m
gpu-operator    node-feature-discovery-master-78d8f6d5b6-prm9v   1/1     Running   0          22m
kai-scheduler   admission-669878d9d8-hbbtf                       1/1     Running   0          21m
kai-scheduler   binder-6d45cf7c89-wbs2x                          1/1     Running   0          21m
kai-scheduler   kai-operator-54df58c759-78qz4                    1/1     Running   0          21m
kai-scheduler   kai-scheduler-default-786b65f669-nf2vb           1/1     Running   0          21m
kai-scheduler   pod-grouper-5d5c88b6fb-n66rl                     1/1     Running   0          21m
kai-scheduler   podgroup-controller-56947478b-524hr              1/1     Running   0          21m
kai-scheduler   queue-controller-5f5b6895b6-gq5qs                1/1     Running   0          21m
kube-system     aws-node-2ck2s                                   2/2     Running   0          86m
kube-system     aws-node-7tmjt                                   2/2     Running   0          86m
kube-system     aws-node-djccv                                   2/2     Running   0          86m
kube-system     aws-node-f4n96                                   2/2     Running   0          86m
kube-system     aws-node-hd2tt                                   2/2     Running   0          86m
kube-system     aws-node-r66w4                                   2/2     Running   0          86m
kube-system     coredns-58d8fddfd7-2vb8w                         1/1     Running   0          106m
kube-system     coredns-58d8fddfd7-6grj9                         1/1     Running   0          106m
kube-system     ebs-csi-controller-5d6f58b85d-2lc45              5/5     Running   0          23m
kube-system     ebs-csi-controller-5d6f58b85d-lqjwt              5/5     Running   0          23m
kube-system     ebs-csi-node-d7rbm                               3/3     Running   0          23m
kube-system     ebs-csi-node-lwksb                               3/3     Running   0          23m
kube-system     kube-proxy-4fwpr                                 1/1     Running   0          105m
kube-system     kube-proxy-7vngs                                 1/1     Running   0          105m
kube-system     kube-proxy-88wfr                                 1/1     Running   0          105m
kube-system     kube-proxy-cbfrc                                 1/1     Running   0          105m
kube-system     kube-proxy-hd897                                 1/1     Running   0          105m
kube-system     kube-proxy-tt7vb                                 1/1     Running   0          105m

Checking on kai and gpu operator

$ k get pods -A | grep -E "kai|gpu-operator"
gpu-operator    gpu-operator-dc849bdc7-mn5vh                     1/1     Running   0          24m
gpu-operator    node-feature-discovery-gc-66bb6c8796-9bl4s       1/1     Running   0          24m
gpu-operator    node-feature-discovery-master-78d8f6d5b6-prm9v   1/1     Running   0          24m
kai-scheduler   admission-669878d9d8-hbbtf                       1/1     Running   0          23m
kai-scheduler   binder-6d45cf7c89-wbs2x                          1/1     Running   0          23m
kai-scheduler   kai-operator-54df58c759-78qz4                    1/1     Running   0          23m
kai-scheduler   kai-scheduler-default-786b65f669-nf2vb           1/1     Running   0          23m
kai-scheduler   pod-grouper-5d5c88b6fb-n66rl                     1/1     Running   0          23m
kai-scheduler   podgroup-controller-56947478b-524hr              1/1     Running   0          23m
kai-scheduler   queue-controller-5f5b6895b6-gq5qs                1/1     Running   0          23m

Helm:

$ helm list -A
NAME              	NAMESPACE    	REVISION	UPDATED                             	STATUS  	CHART                           	APP VERSION
aws-ebs-csi-driver	kube-system  	1       	2026-02-20 05:28:38.765555 -0800 PST	deployed	aws-ebs-csi-driver-2.55.0       	1.55.0
aws-efa           	kube-system  	1       	2026-02-20 05:29:08.283708 -0800 PST	deployed	aws-efa-k8s-device-plugin-v0.5.3	v0.5.3
cert-manager      	cert-manager 	1       	2026-02-20 05:29:13.790312 -0800 PST	deployed	cert-manager-v1.17.2            	v1.17.2
gpu-operator      	gpu-operator 	1       	2026-02-20 05:29:54.231046 -0800 PST	deployed	gpu-operator-v25.10.1           	v25.10.1
kai-scheduler     	kai-scheduler	1       	2026-02-20 05:30:26.743182 -0800 PST	failed  	kai-scheduler-v0.12.14          	v0.12.14

Scheduling shards

$ kubectl get schedulingshards -A
NAME      AGE
default   27m

No logs

kubectl logs -n kai-scheduler -l app.kubernetes.io/name=kai-operator
No resources found in kai-scheduler namespace.

GPUs aren’t recognized

$ kubectl describe node | grep -E "Capacity|Allocatable|nvidia.com/gpu"
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:

Expected Behavior

Deploy script should result in a fully configured clsuter

Actual Behavior

Deployment script exits with above error

Environment

  • Eidos version (CLI eidos version, API image tag, or commit SHA): v0.7.3-next (commit: 4defc33, date: 2026-02-20T12:51:37Z)
  • Install method (release binary / build from source / container image): build
  • Platform (eks/gke/aks/self-managed): eks
  • Kubernetes version: v1.34
  • OS (ubuntu/cos/other) + version: Ubuntu 24.04
  • Kernel version: 6.8
  • GPU type (h100/gb200/a100/l40/other): h100
  • Workload intent (training/inference): training

Command / Request Used

No response

Logs / Error Output

Additional Context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions