Skip to content

mohsenkamini/DCAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DCAI

Distributed Computing for Artificial Intellegence over k8s. The project is deployed in SBU site.

📖 Table of Contents

Network & Internet setup

The following file will provide complete internet connection on hosts.

Do this:

vi ./network/.hotspot.sh

Add proper credentials to it, and run the following to setup network on nodes using ansible:

cd ./ansible/
vi inventory

An example for inventory could be like this:

[masters]
master1 ansible_host=master1.sbu-dcai.ir

[workers]
worker01 ansible_host=worker01.sbu-dcai.ir
worker02 ansible_host=worker02.sbu-dcai.ir

[others]
registry ansible_host=registry.sbu-dcai.ir
monitoring ansible_host=monitoring.sbu-dcai.ir

[all:vars]
host_key_checking = false
ansible_user=CHANGEME
ansible_ssh_port=CHANGEME
ansible_become=yes
ansible_ssh_private_key_file=CHANGEME

And finally:

ansible-playbook -i ./inventory ./plays/network-setup.yml

Kubernetes setup

Created initial cluster using this repo.

GPU configuration for K8s

Drivers

Before adding a worker node with GPU on it, install nvidia drives using apt. You can read more on that on this article.

Note: only install drivers without prefixes. This could be a valid example: nvidia-driver-470

And here's an invalid one: nvidia-driver-535-server-open

Finally test if everything is ready:

nvidia-smi

Make sure to fix nvidia-driver's version

apt-mark hold nvidia-driver-470

Note: a common issue is that when you install this driver, gdm3 also gets installed and causes your server to be asleep after an idle timeout. So make sure you delete it.

apt remove gdm3
systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

Node Feature Discovery (NFD)

Installed using this repo.

Nvidia GPU operator

These are the needed documents to implement gpu operator:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && chmod 700 get_helm.sh \
    && ./get_helm.sh
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update
helm install --debug --wait --generate-name --timeout 50m \
     -n gpu-operator --create-namespace      nvidia/gpu-operator

Sub Admin access

The userObjects directory contains any kubernetes objects that are created by us. The access for a subAdmin is granted with roles and cluster roles within the subAdmin directory.

The certificates are handled within the certificates directory

Monitoring

You can use this repo.

Cluster Backup

The scripts to get backups are in this directory. currently only support a kubectl get all -A as backup.

make that run frequently:

cp ./backup/* /usr/bin/
echo "48 23 * * * root /usr/bin/cluster-get-all-backup.sh /usr/share/backups/get-all/
49 23 * * 5 root /usr/bin/keep-n-backups.sh /usr/share/backups/get-all/ 8 " > /etc/cron.d/backup

Private Docker Registry

To use images in private registry do as following:

docker pull <image> # pull the image locally
docker image tag <image> registry.mohsenkamini.ir:5000/<image>
docker image push registry.mohsenkamini.ir:5000/<image> # push it to the private reg

About

Distributed Computing for Artificial Intellegence over k8s

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages