[Bug]: CUJ1 Regression, KAI scheduler fails to install

### Prerequisites

- [x] I searched existing issues and found no duplicates
- [x] I can reproduce this issue consistently
- [x] This is not a security vulnerability (use [Security Advisories](https://github.com/NVIDIA/eidos/security/advisories/new) instead)

### Bug Description

When executing `./deploy.sh` I see this error: 

```shell
Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded
```

### Impact

Blocking (cannot proceed)

### Component

CLI (eidos)

### Regression?

Yes, this worked before (please specify version below)

### Steps to Reproduce

Starting with cluster with 3 system nodes and 2 GPU nodes, and 1 CPU node

```shell
$ k get nodes -o wide
NAME                           STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
ip-10-0-11-191.ec2.internal    Ready    <none>   102m   v1.34.3   10.0.11.191    <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-131-78.ec2.internal    Ready    <none>   102m   v1.32.5   10.0.131.78    <none>        Ubuntu 22.04.5 LTS   6.8.0-1028-aws    containerd://1.7.27
ip-10-0-207-134.ec2.internal   Ready    <none>   102m   v1.32.5   10.0.207.134   <none>        Ubuntu 22.04.5 LTS   6.8.0-1028-aws    containerd://1.7.27
ip-10-0-222-107.ec2.internal   Ready    <none>   102m   v1.34.3   10.0.222.107   <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-4-118.ec2.internal     Ready    <none>   102m   v1.34.3   10.0.4.118     <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
ip-10-0-7-217.ec2.internal     Ready    <none>   102m   v1.34.3   10.0.7.217     <none>        Ubuntu 24.04.4 LTS   6.17.0-1007-aws   containerd://1.7.28
```

Pods prior to installation: 

```shell
​​$ k get pods -A
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   aws-node-2ck2s             2/2     Running   0          4m50s
kube-system   aws-node-7tmjt             2/2     Running   0          5m20s
kube-system   aws-node-djccv             2/2     Running   0          4m53s
kube-system   aws-node-f4n96             2/2     Running   0          4m56s
kube-system   aws-node-hd2tt             2/2     Running   0          5m2s
kube-system   aws-node-r66w4             2/2     Running   0          4m58s
kube-system   coredns-58d8fddfd7-2vb8w   1/1     Running   0          25m
kube-system   coredns-58d8fddfd7-6grj9   1/1     Running   0          25m
kube-system   kube-proxy-4fwpr           1/1     Running   0          24m
kube-system   kube-proxy-7vngs           1/1     Running   0          23m
kube-system   kube-proxy-88wfr           1/1     Running   0          23m
kube-system   kube-proxy-cbfrc           1/1     Running   0          24m
kube-system   kube-proxy-hd897           1/1     Running   0          24m
kube-system   kube-proxy-tt7vb           1/1     Running   0          24m
```

Connect to cluster: 
```shell
aws eks update-kubeconfig --region us-east-1 --name aicr-demo --alias aicr-demo
```

Gen recipe: 
```shell
eidos recipe \
  --service eks \
  --accelerator h100 \
  --intent training \
  --os ubuntu \
  --platform kubeflow \
  --output recipe.yaml
```

Gen bundle: 
```shell
eidos bundle \
  --recipe recipe.yaml \
  --accelerated-node-selector nodeGroup=gpu-worker\
  --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
  --output bundle
```

Deploy: 
```shell
cd ./bundle && chmod +x deploy.sh && ./deploy.sh
```

Output:
```shell
Deploying Cloud Native Stack components...
Installing aws-ebs-csi-driver (kube-system)...
Release "aws-ebs-csi-driver" does not exist. Installing it now.
I0220 05:28:45.656395   91946 warnings.go:107] "Warning: spec.template.spec.containers[1].ports[0]: duplicate port name \"healthz\" with spec.template.spec.containers[0].ports[0], services and probes that select ports by name will use spec.template.spec.containers[0].ports[0]"
NAME: aws-ebs-csi-driver
LAST DEPLOYED: Fri Feb 20 05:28:38 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
To verify that aws-ebs-csi-driver has started, run:

    kubectl get pod -n kube-system -l "app.kubernetes.io/name=aws-ebs-csi-driver,app.kubernetes.io/instance=aws-ebs-csi-driver"

The "a1CompatibilityDaemonSet" parameter has been removed. For more information see the EBS CSI Helm Chart changelog:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/charts/aws-ebs-csi-driver/CHANGELOG.md#2550
Installing aws-efa (kube-system)...
Release "aws-efa" does not exist. Installing it now.
NAME: aws-efa
LAST DEPLOYED: Fri Feb 20 05:29:08 2026
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
EFA device plugin is installed, it can be requested as `vpc.amazonaws.com/efa` resource.
Installing cert-manager (cert-manager)...
Release "cert-manager" does not exist. Installing it now.
NAME: cert-manager
LAST DEPLOYED: Fri Feb 20 05:29:13 2026
NAMESPACE: cert-manager
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
⚠️  WARNING: `installCRDs` is deprecated, use `crds.enabled` instead.
cert-manager v1.17.2 has been deployed successfully!

In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).

More information on the different types of issuers and how to configure them
can be found in our documentation:

https://cert-manager.io/docs/configuration/

For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:

https://cert-manager.io/docs/usage/ingress/
Installing gpu-operator (gpu-operator)...
Release "gpu-operator" does not exist. Installing it now.
I0220 05:29:58.356053   93092 warnings.go:107] "Warning: spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use \"node-role.kubernetes.io/control-plane\" instead"
NAME: gpu-operator
LAST DEPLOYED: Fri Feb 20 05:29:54 2026
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
Applying manifests for gpu-operator...
configmap/dcgm-exporter created
Installing kai-scheduler (kai-scheduler)...
Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.12.14
Digest: sha256:97d8f439f2432c42e996027bbbe15d5131eaa9b69aa803c5b60ea219562ac3e4
Error: resource not ready, name: default, kind: SchedulingShard, status: InProgress
context deadline exceeded
```

Debug: 
```shell
$ k get pods -A
NAMESPACE       NAME                                             READY   STATUS    RESTARTS   AGE
cert-manager    cert-manager-85888c5d66-qfljj                    1/1     Running   0          22m
cert-manager    cert-manager-cainjector-7476657f99-sltcp         1/1     Running   0          22m
cert-manager    cert-manager-webhook-6655cc77b8-n7psl            1/1     Running   0          22m
gpu-operator    gpu-operator-dc849bdc7-mn5vh                     1/1     Running   0          22m
gpu-operator    node-feature-discovery-gc-66bb6c8796-9bl4s       1/1     Running   0          22m
gpu-operator    node-feature-discovery-master-78d8f6d5b6-prm9v   1/1     Running   0          22m
kai-scheduler   admission-669878d9d8-hbbtf                       1/1     Running   0          21m
kai-scheduler   binder-6d45cf7c89-wbs2x                          1/1     Running   0          21m
kai-scheduler   kai-operator-54df58c759-78qz4                    1/1     Running   0          21m
kai-scheduler   kai-scheduler-default-786b65f669-nf2vb           1/1     Running   0          21m
kai-scheduler   pod-grouper-5d5c88b6fb-n66rl                     1/1     Running   0          21m
kai-scheduler   podgroup-controller-56947478b-524hr              1/1     Running   0          21m
kai-scheduler   queue-controller-5f5b6895b6-gq5qs                1/1     Running   0          21m
kube-system     aws-node-2ck2s                                   2/2     Running   0          86m
kube-system     aws-node-7tmjt                                   2/2     Running   0          86m
kube-system     aws-node-djccv                                   2/2     Running   0          86m
kube-system     aws-node-f4n96                                   2/2     Running   0          86m
kube-system     aws-node-hd2tt                                   2/2     Running   0          86m
kube-system     aws-node-r66w4                                   2/2     Running   0          86m
kube-system     coredns-58d8fddfd7-2vb8w                         1/1     Running   0          106m
kube-system     coredns-58d8fddfd7-6grj9                         1/1     Running   0          106m
kube-system     ebs-csi-controller-5d6f58b85d-2lc45              5/5     Running   0          23m
kube-system     ebs-csi-controller-5d6f58b85d-lqjwt              5/5     Running   0          23m
kube-system     ebs-csi-node-d7rbm                               3/3     Running   0          23m
kube-system     ebs-csi-node-lwksb                               3/3     Running   0          23m
kube-system     kube-proxy-4fwpr                                 1/1     Running   0          105m
kube-system     kube-proxy-7vngs                                 1/1     Running   0          105m
kube-system     kube-proxy-88wfr                                 1/1     Running   0          105m
kube-system     kube-proxy-cbfrc                                 1/1     Running   0          105m
kube-system     kube-proxy-hd897                                 1/1     Running   0          105m
kube-system     kube-proxy-tt7vb                                 1/1     Running   0          105m
```

Checking on kai and gpu operator
```shell
$ k get pods -A | grep -E "kai|gpu-operator"
gpu-operator    gpu-operator-dc849bdc7-mn5vh                     1/1     Running   0          24m
gpu-operator    node-feature-discovery-gc-66bb6c8796-9bl4s       1/1     Running   0          24m
gpu-operator    node-feature-discovery-master-78d8f6d5b6-prm9v   1/1     Running   0          24m
kai-scheduler   admission-669878d9d8-hbbtf                       1/1     Running   0          23m
kai-scheduler   binder-6d45cf7c89-wbs2x                          1/1     Running   0          23m
kai-scheduler   kai-operator-54df58c759-78qz4                    1/1     Running   0          23m
kai-scheduler   kai-scheduler-default-786b65f669-nf2vb           1/1     Running   0          23m
kai-scheduler   pod-grouper-5d5c88b6fb-n66rl                     1/1     Running   0          23m
kai-scheduler   podgroup-controller-56947478b-524hr              1/1     Running   0          23m
kai-scheduler   queue-controller-5f5b6895b6-gq5qs                1/1     Running   0          23m

```

Helm:
```shell
$ helm list -A
NAME              	NAMESPACE    	REVISION	UPDATED                             	STATUS  	CHART                           	APP VERSION
aws-ebs-csi-driver	kube-system  	1       	2026-02-20 05:28:38.765555 -0800 PST	deployed	aws-ebs-csi-driver-2.55.0       	1.55.0
aws-efa           	kube-system  	1       	2026-02-20 05:29:08.283708 -0800 PST	deployed	aws-efa-k8s-device-plugin-v0.5.3	v0.5.3
cert-manager      	cert-manager 	1       	2026-02-20 05:29:13.790312 -0800 PST	deployed	cert-manager-v1.17.2            	v1.17.2
gpu-operator      	gpu-operator 	1       	2026-02-20 05:29:54.231046 -0800 PST	deployed	gpu-operator-v25.10.1           	v25.10.1
kai-scheduler     	kai-scheduler	1       	2026-02-20 05:30:26.743182 -0800 PST	failed  	kai-scheduler-v0.12.14          	v0.12.14
```

Scheduling shards
```shell
$ kubectl get schedulingshards -A
NAME      AGE
default   27m
```

No logs
```shell
kubectl logs -n kai-scheduler -l app.kubernetes.io/name=kai-operator
No resources found in kai-scheduler namespace.
```

GPUs aren’t recognized 
```shell
$ kubectl describe node | grep -E "Capacity|Allocatable|nvidia.com/gpu"
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
Capacity:
Allocatable:
```



### Expected Behavior

Deploy script should result in a fully configured clsuter

### Actual Behavior

Deployment script exits with above error

### Environment

- Eidos version (CLI `eidos version`, API image tag, or commit SHA): v0.7.3-next (commit: 4defc33, date: 2026-02-20T12:51:37Z)
- Install method (release binary / build from source / container image): build
- Platform (eks/gke/aks/self-managed): eks
- Kubernetes version: v1.34
- OS (ubuntu/cos/other) + version: Ubuntu 24.04
- Kernel version: 6.8
- GPU type (h100/gb200/a100/l40/other): h100
- Workload intent (training/inference): training


### Command / Request Used

_No response_

### Logs / Error Output

```shell

```

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CUJ1 Regression, KAI scheduler fails to install #165

Prerequisites

Bug Description

Impact

Component

Regression?

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Command / Request Used

Logs / Error Output

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: CUJ1 Regression, KAI scheduler fails to install #165

Description

Prerequisites

Bug Description

Impact

Component

Regression?

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Command / Request Used

Logs / Error Output

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions