Prerequisites
Bug Description
When executing ./deploy.sh I see this error:
Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded
Impact
Blocking (cannot proceed)
Component
CLI (eidos)
Regression?
Yes, this worked before (please specify version below)
Steps to Reproduce
Starting with cluster with 3 system nodes and 2 GPU nodes, and 1 CPU node
$ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-11-191.ec2.internal Ready <none> 102m v1.34.3 10.0.11.191 <none> Ubuntu 24.04.4 LTS 6.17.0-1007-aws containerd://1.7.28
ip-10-0-131-78.ec2.internal Ready <none> 102m v1.32.5 10.0.131.78 <none> Ubuntu 22.04.5 LTS 6.8.0-1028-aws containerd://1.7.27
ip-10-0-207-134.ec2.internal Ready <none> 102m v1.32.5 10.0.207.134 <none> Ubuntu 22.04.5 LTS 6.8.0-1028-aws containerd://1.7.27
ip-10-0-222-107.ec2.internal Ready <none> 102m v1.34.3 10.0.222.107 <none> Ubuntu 24.04.4 LTS 6.17.0-1007-aws containerd://1.7.28
ip-10-0-4-118.ec2.internal Ready <none> 102m v1.34.3 10.0.4.118 <none> Ubuntu 24.04.4 LTS 6.17.0-1007-aws containerd://1.7.28
ip-10-0-7-217.ec2.internal Ready <none> 102m v1.34.3 10.0.7.217 <none> Ubuntu 24.04.4 LTS 6.17.0-1007-aws containerd://1.7.28
Pods prior to installation:
$ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-2ck2s 2/2 Running 0 4m50s
kube-system aws-node-7tmjt 2/2 Running 0 5m20s
kube-system aws-node-djccv 2/2 Running 0 4m53s
kube-system aws-node-f4n96 2/2 Running 0 4m56s
kube-system aws-node-hd2tt 2/2 Running 0 5m2s
kube-system aws-node-r66w4 2/2 Running 0 4m58s
kube-system coredns-58d8fddfd7-2vb8w 1/1 Running 0 25m
kube-system coredns-58d8fddfd7-6grj9 1/1 Running 0 25m
kube-system kube-proxy-4fwpr 1/1 Running 0 24m
kube-system kube-proxy-7vngs 1/1 Running 0 23m
kube-system kube-proxy-88wfr 1/1 Running 0 23m
kube-system kube-proxy-cbfrc 1/1 Running 0 24m
kube-system kube-proxy-hd897 1/1 Running 0 24m
kube-system kube-proxy-tt7vb 1/1 Running 0 24m
Connect to cluster:
aws eks update-kubeconfig --region us-east-1 --name aicr-demo --alias aicr-demo
Gen recipe:
eidos recipe \
--service eks \
--accelerator h100 \
--intent training \
--os ubuntu \
--platform kubeflow \
--output recipe.yaml
Gen bundle:
eidos bundle \
--recipe recipe.yaml \
--accelerated-node-selector nodeGroup=gpu-worker\
--accelerated-node-toleration dedicated=worker-workload:NoSchedule \
--output bundle
Deploy:
cd ./bundle && chmod +x deploy.sh && ./deploy.sh
Output:
Deploying Cloud Native Stack components...
Installing aws-ebs-csi-driver (kube-system)...
Release "aws-ebs-csi-driver" does not exist. Installing it now.
I0220 05:28:45.656395 91946 warnings.go:107] "Warning: spec.template.spec.containers[1].ports[0]: duplicate port name \"healthz\" with spec.template.spec.containers[0].ports[0], services and probes that select ports by name will use spec.template.spec.containers[0].ports[0]"
NAME: aws-ebs-csi-driver
LAST DEPLOYED: Fri Feb 20 05:28:38 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
To verify that aws-ebs-csi-driver has started, run:
kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-ebs-csi-driver,app.kubernetes.io/instance=aws-ebs-csi-driver"
The "a1CompatibilityDaemonSet" parameter has been removed. For more information see the EBS CSI Helm Chart changelog:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/CHANGELOG.md#2550
Installing aws-efa (kube-system)...
Release "aws-efa" does not exist. Installing it now.
NAME: aws-efa
LAST DEPLOYED: Fri Feb 20 05:29:08 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
EFA device plugin is installed, it can be requested as `vpc.amazonaws.com/efa` resource.
Installing cert-manager (cert-manager)...
Release "cert-manager" does not exist. Installing it now.
NAME: cert-manager
LAST DEPLOYED: Fri Feb 20 05:29:13 2026
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
⚠️ WARNING: `installCRDs` is deprecated, use `crds.enabled` instead.
cert-manager v1.17.2 has been deployed successfully!
In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).
More information on the different types of issuers and how to configure them
can be found in our documentation:
https://cert-manager.io/docs/configuration/
For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:
https://cert-manager.io/docs/usage/ingress/
Installing gpu-operator (gpu-operator)...
Release "gpu-operator" does not exist. Installing it now.
I0220 05:29:58.356053 93092 warnings.go:107] "Warning: spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use \"node-role.kubernetes.io/control-plane\" instead"
NAME: gpu-operator
LAST DEPLOYED: Fri Feb 20 05:29:54 2026
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded
Debug:
$ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-85888c5d66-qfljj 1/1 Running 0 22m
cert-manager cert-manager-cainjector-7476657f99-sltcp 1/1 Running 0 22m
cert-manager cert-manager-webhook-6655cc77b8-n7psl 1/1 Running 0 22m
gpu-operator gpu-operator-dc849bdc7-mn5vh 1/1 Running 0 22m
gpu-operator node-feature-discovery-gc-66bb6c8796-9bl4s 1/1 Running 0 22m
gpu-operator node-feature-discovery-master-78d8f6d5b6-prm9v 1/1 Running 0 22m
kai-scheduler admission-669878d9d8-hbbtf 1/1 Running 0 21m
kai-scheduler binder-6d45cf7c89-wbs2x 1/1 Running 0 21m
kai-scheduler kai-operator-54df58c759-78qz4 1/1 Running 0 21m
kai-scheduler kai-scheduler-default-786b65f669-nf2vb 1/1 Running 0 21m
kai-scheduler pod-grouper-5d5c88b6fb-n66rl 1/1 Running 0 21m
kai-scheduler podgroup-controller-56947478b-524hr 1/1 Running 0 21m
kai-scheduler queue-controller-5f5b6895b6-gq5qs 1/1 Running 0 21m
kube-system aws-node-2ck2s 2/2 Running 0 86m
kube-system aws-node-7tmjt 2/2 Running 0 86m
kube-system aws-node-djccv 2/2 Running 0 86m
kube-system aws-node-f4n96 2/2 Running 0 86m
kube-system aws-node-hd2tt 2/2 Running 0 86m
kube-system aws-node-r66w4 2/2 Running 0 86m
kube-system coredns-58d8fddfd7-2vb8w 1/1 Running 0 106m
kube-system coredns-58d8fddfd7-6grj9 1/1 Running 0 106m
kube-system ebs-csi-controller-5d6f58b85d-2lc45 5/5 Running 0 23m
kube-system ebs-csi-controller-5d6f58b85d-lqjwt 5/5 Running 0 23m
kube-system ebs-csi-node-d7rbm 3/3 Running 0 23m
kube-system ebs-csi-node-lwksb 3/3 Running 0 23m
kube-system kube-proxy-4fwpr 1/1 Running 0 105m
kube-system kube-proxy-7vngs 1/1 Running 0 105m
kube-system kube-proxy-88wfr 1/1 Running 0 105m
kube-system kube-proxy-cbfrc 1/1 Running 0 105m
kube-system kube-proxy-hd897 1/1 Running 0 105m
kube-system kube-proxy-tt7vb 1/1 Running 0 105m
Checking on kai and gpu operator
$ k get pods -A | grep -E "kai|gpu-operator"
gpu-operator gpu-operator-dc849bdc7-mn5vh 1/1 Running 0 24m
gpu-operator node-feature-discovery-gc-66bb6c8796-9bl4s 1/1 Running 0 24m
gpu-operator node-feature-discovery-master-78d8f6d5b6-prm9v 1/1 Running 0 24m
kai-scheduler admission-669878d9d8-hbbtf 1/1 Running 0 23m
kai-scheduler binder-6d45cf7c89-wbs2x 1/1 Running 0 23m
kai-scheduler kai-operator-54df58c759-78qz4 1/1 Running 0 23m
kai-scheduler kai-scheduler-default-786b65f669-nf2vb 1/1 Running 0 23m
kai-scheduler pod-grouper-5d5c88b6fb-n66rl 1/1 Running 0 23m
kai-scheduler podgroup-controller-56947478b-524hr 1/1 Running 0 23m
kai-scheduler queue-controller-5f5b6895b6-gq5qs 1/1 Running 0 23m
Helm:
$ helm list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
aws-ebs-csi-driver kube-system 1 2026-02-20 05:28:38.765555 -0800 PST deployed aws-ebs-csi-driver-2.55.0 1.55.0
aws-efa kube-system 1 2026-02-20 05:29:08.283708 -0800 PST deployed aws-efa-k8s-device-plugin-v0.5.3 v0.5.3
cert-manager cert-manager 1 2026-02-20 05:29:13.790312 -0800 PST deployed cert-manager-v1.17.2 v1.17.2
gpu-operator gpu-operator 1 2026-02-20 05:29:54.231046 -0800 PST deployed gpu-operator-v25.10.1 v25.10.1
kai-scheduler kai-scheduler 1 2026-02-20 05:30:26.743182 -0800 PST failed kai-scheduler-v0.12.14 v0.12.14
Scheduling shards
$ kubectl get schedulingshards -A
NAME AGE
default 27m
No logs
kubectl logs -n kai-scheduler -l app.kubernetes.io/name=kai-operator
No resources found in kai-scheduler namespace.
GPUs aren’t recognized
$ kubectl describe node | grep -E "Capacity|Allocatable|nvidia.com/gpu"
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Expected Behavior
Deploy script should result in a fully configured clsuter
Actual Behavior
Deployment script exits with above error
Environment
- Eidos version (CLI
eidos version, API image tag, or commit SHA): v0.7.3-next (commit: 4defc33, date: 2026-02-20T12:51:37Z)
- Install method (release binary / build from source / container image): build
- Platform (eks/gke/aks/self-managed): eks
- Kubernetes version: v1.34
- OS (ubuntu/cos/other) + version: Ubuntu 24.04
- Kernel version: 6.8
- GPU type (h100/gb200/a100/l40/other): h100
- Workload intent (training/inference): training
Command / Request Used
No response
Logs / Error Output
Additional Context
No response
Prerequisites
Bug Description
When executing
./deploy.shI see this error:Impact
Blocking (cannot proceed)
Component
CLI (eidos)
Regression?
Yes, this worked before (please specify version below)
Steps to Reproduce
Starting with cluster with 3 system nodes and 2 GPU nodes, and 1 CPU node
Pods prior to installation:
Connect to cluster:
Gen recipe:
Gen bundle:
Deploy:
Output:
Debug:
Checking on kai and gpu operator
Helm:
Scheduling shards
No logs
kubectl logs -n kai-scheduler -l app.kubernetes.io/name=kai-operator No resources found in kai-scheduler namespace.GPUs aren’t recognized
Expected Behavior
Deploy script should result in a fully configured clsuter
Actual Behavior
Deployment script exits with above error
Environment
eidos version, API image tag, or commit SHA): v0.7.3-next (commit: 4defc33, date: 2026-02-20T12:51:37Z)Command / Request Used
No response
Logs / Error Output
Additional Context
No response