Skip to content

The installation of Charmed kubernetes with GPU as local couldn't ended #830

@iiot-architect

Description

@iiot-architect

I'm trying the installation of Charmed Kubernetes with NVIDIA GPU on an Amazon EC2 instance(g5.xlarge) as local:

sudo snap install juju --classic
juju add-credential localhost
juju clouds
juju bootstrap
juju add-model k8s
juju deploy charmed-kubernetes
juju config calico ignore-loose-rpf=true

However I seem that the process isn't ended for over 3 hours:

ubuntu@ip-10-10-1-38:~$ juju status
Model  Controller           Cloud/Region         Version  SLA          Timestamp
k8s    localhost-localhost  localhost/localhost  3.3.1    unsupported  09:31:30Z

App                       Version  Status   Scale  Charm                     Channel      Rev  Exposed  Message
calico                    3.21.4   active       5  calico                    1.27/stable   87  no       Calico is active
containerd                         blocked      5  containerd                1.27/stable   65  no       containerd resource binary containerd-stress failed a version check
easyrsa                   3.0.1    active       1  easyrsa                   1.27/stable   42  no       Certificate Authority connected.
etcd                      3.4.22   active       3  etcd                      1.27/stable  742  no       Healthy with 3 known peers
kubeapi-load-balancer     1.18.0   active       1  kubeapi-load-balancer     1.27/stable   79  yes      Loadbalancer ready.
kubernetes-control-plane  1.27.10  waiting      2  kubernetes-control-plane  1.27/stable  274  no       Waiting for 4 kube-system pods to start
kubernetes-worker         1.27.10  waiting      3  kubernetes-worker         1.27/stable  112  yes      Waiting for kubelet to start.

Unit                         Workload  Agent  Machine  Public address  Ports         Message
easyrsa/0*                   active    idle   0        10.132.163.17                 Certificate Authority connected.
etcd/0*                      active    idle   1        10.132.163.184  2379/tcp      Healthy with 3 known peers
etcd/1                       active    idle   2        10.132.163.135  2379/tcp      Healthy with 3 known peers
etcd/2                       active    idle   3        10.132.163.233  2379/tcp      Healthy with 3 known peers
kubeapi-load-balancer/0*     active    idle   4        10.132.163.33   443,6443/tcp  Loadbalancer ready.
kubernetes-control-plane/0   waiting   idle   5        10.132.163.119  6443/tcp      Waiting for 4 kube-system pods to start
  calico/3                   active    idle            10.132.163.119                Calico is active
  containerd/3               blocked   idle            10.132.163.119                containerd resource binary containerd-stress failed a version check
kubernetes-control-plane/1*  waiting   idle   6        10.132.163.146  6443/tcp      Waiting for 4 kube-system pods to start
  calico/4                   active    idle            10.132.163.146                Calico is active
  containerd/4               blocked   idle            10.132.163.146                containerd resource binary containerd-stress failed a version check
kubernetes-worker/0*         waiting   idle   7        10.132.163.121  80,443/tcp    Waiting for kubelet to start.
  calico/2                   active    idle            10.132.163.121                Calico is active
  containerd/2               blocked   idle            10.132.163.121                containerd resource binary containerd-stress failed a version check
kubernetes-worker/1          waiting   idle   8        10.132.163.243  80,443/tcp    Waiting for kubelet to start.
  calico/0*                  active    idle            10.132.163.243                Calico is active
  containerd/0*              blocked   idle            10.132.163.243                containerd resource binary containerd-stress failed a version check
kubernetes-worker/2          waiting   idle   9        10.132.163.140  80,443/tcp    Waiting for kubelet to start.
  calico/1                   active    idle            10.132.163.140                Calico is active
  containerd/1               blocked   idle            10.132.163.140                containerd resource binary containerd-stress failed a version check

Machine  State    Address         Inst id        Base          AZ  Message
0        started  10.132.163.17   juju-84dc78-0  [email protected]      Running
1        started  10.132.163.184  juju-84dc78-1  [email protected]      Running
2        started  10.132.163.135  juju-84dc78-2  [email protected]      Running
3        started  10.132.163.233  juju-84dc78-3  [email protected]      Running
4        started  10.132.163.33   juju-84dc78-4  [email protected]      Running
5        started  10.132.163.119  juju-84dc78-5  [email protected]      Running
6        started  10.132.163.146  juju-84dc78-6  [email protected]      Running
7        started  10.132.163.121  juju-84dc78-7  [email protected]      Running
8        started  10.132.163.243  juju-84dc78-8  [email protected]      Running
9        started  10.132.163.140  juju-84dc78-9  [email protected]      Running

kubernetes-control-plane is repeatedly showing the message between 'Restarting snap.kubelet.daemon service' and 'Waiting for 4 kube-system pods to start'.
Also containerd is repeatedly showing the message between 'Unpacking containerd resource' and 'containerd resource binary containerd-stress failed a version check' as well.

The instance was installed the following software before the installation process:

NVIDIA GPU Driver:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
NVIDIA CUDA:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb

And I tried version 1.28/stable and 1.27/stable but the symptoms was almost same.
How can I improve this problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions