-
Notifications
You must be signed in to change notification settings - Fork 58
Description
I'm trying the installation of Charmed Kubernetes with NVIDIA GPU on an Amazon EC2 instance(g5.xlarge) as local:
sudo snap install juju --classic
juju add-credential localhost
juju clouds
juju bootstrap
juju add-model k8s
juju deploy charmed-kubernetes
juju config calico ignore-loose-rpf=true
However I seem that the process isn't ended for over 3 hours:
ubuntu@ip-10-10-1-38:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
k8s localhost-localhost localhost/localhost 3.3.1 unsupported 09:31:30Z
App Version Status Scale Charm Channel Rev Exposed Message
calico 3.21.4 active 5 calico 1.27/stable 87 no Calico is active
containerd blocked 5 containerd 1.27/stable 65 no containerd resource binary containerd-stress failed a version check
easyrsa 3.0.1 active 1 easyrsa 1.27/stable 42 no Certificate Authority connected.
etcd 3.4.22 active 3 etcd 1.27/stable 742 no Healthy with 3 known peers
kubeapi-load-balancer 1.18.0 active 1 kubeapi-load-balancer 1.27/stable 79 yes Loadbalancer ready.
kubernetes-control-plane 1.27.10 waiting 2 kubernetes-control-plane 1.27/stable 274 no Waiting for 4 kube-system pods to start
kubernetes-worker 1.27.10 waiting 3 kubernetes-worker 1.27/stable 112 yes Waiting for kubelet to start.
Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0 10.132.163.17 Certificate Authority connected.
etcd/0* active idle 1 10.132.163.184 2379/tcp Healthy with 3 known peers
etcd/1 active idle 2 10.132.163.135 2379/tcp Healthy with 3 known peers
etcd/2 active idle 3 10.132.163.233 2379/tcp Healthy with 3 known peers
kubeapi-load-balancer/0* active idle 4 10.132.163.33 443,6443/tcp Loadbalancer ready.
kubernetes-control-plane/0 waiting idle 5 10.132.163.119 6443/tcp Waiting for 4 kube-system pods to start
calico/3 active idle 10.132.163.119 Calico is active
containerd/3 blocked idle 10.132.163.119 containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1* waiting idle 6 10.132.163.146 6443/tcp Waiting for 4 kube-system pods to start
calico/4 active idle 10.132.163.146 Calico is active
containerd/4 blocked idle 10.132.163.146 containerd resource binary containerd-stress failed a version check
kubernetes-worker/0* waiting idle 7 10.132.163.121 80,443/tcp Waiting for kubelet to start.
calico/2 active idle 10.132.163.121 Calico is active
containerd/2 blocked idle 10.132.163.121 containerd resource binary containerd-stress failed a version check
kubernetes-worker/1 waiting idle 8 10.132.163.243 80,443/tcp Waiting for kubelet to start.
calico/0* active idle 10.132.163.243 Calico is active
containerd/0* blocked idle 10.132.163.243 containerd resource binary containerd-stress failed a version check
kubernetes-worker/2 waiting idle 9 10.132.163.140 80,443/tcp Waiting for kubelet to start.
calico/1 active idle 10.132.163.140 Calico is active
containerd/1 blocked idle 10.132.163.140 containerd resource binary containerd-stress failed a version check
Machine State Address Inst id Base AZ Message
0 started 10.132.163.17 juju-84dc78-0 [email protected] Running
1 started 10.132.163.184 juju-84dc78-1 [email protected] Running
2 started 10.132.163.135 juju-84dc78-2 [email protected] Running
3 started 10.132.163.233 juju-84dc78-3 [email protected] Running
4 started 10.132.163.33 juju-84dc78-4 [email protected] Running
5 started 10.132.163.119 juju-84dc78-5 [email protected] Running
6 started 10.132.163.146 juju-84dc78-6 [email protected] Running
7 started 10.132.163.121 juju-84dc78-7 [email protected] Running
8 started 10.132.163.243 juju-84dc78-8 [email protected] Running
9 started 10.132.163.140 juju-84dc78-9 [email protected] Running
kubernetes-control-plane is repeatedly showing the message between 'Restarting snap.kubelet.daemon service' and 'Waiting for 4 kube-system pods to start'.
Also containerd is repeatedly showing the message between 'Unpacking containerd resource' and 'containerd resource binary containerd-stress failed a version check' as well.
The instance was installed the following software before the installation process:
NVIDIA GPU Driver:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
NVIDIA CUDA:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
And I tried version 1.28/stable and 1.27/stable but the symptoms was almost same.
How can I improve this problem?