DCAI

Distributed Computing for Artificial Intellegence over k8s. The project is deployed in SBU site.

📖 Table of Contents

Network & Internet setup
Kubernetes setup
GPU configuration for K8s
Sub Admin access
Cluster Backup
Monitoring
Private Docker Registry

Network & Internet setup

The following file will provide complete internet connection on hosts.

Do this:

vi ./network/.hotspot.sh

Add proper credentials to it, and run the following to setup network on nodes using ansible:

cd ./ansible/
vi inventory

An example for inventory could be like this:

[masters]
master1 ansible_host=master1.sbu-dcai.ir

[workers]
worker01 ansible_host=worker01.sbu-dcai.ir
worker02 ansible_host=worker02.sbu-dcai.ir

[others]
registry ansible_host=registry.sbu-dcai.ir
monitoring ansible_host=monitoring.sbu-dcai.ir

[all:vars]
host_key_checking = false
ansible_user=CHANGEME
ansible_ssh_port=CHANGEME
ansible_become=yes
ansible_ssh_private_key_file=CHANGEME

And finally:

ansible-playbook -i ./inventory ./plays/network-setup.yml

Kubernetes setup

Created initial cluster using this repo.

GPU configuration for K8s

Drivers

Before adding a worker node with GPU on it, install nvidia drives using apt. You can read more on that on this article.

Note: only install drivers without prefixes. This could be a valid example: nvidia-driver-470

And here's an invalid one: nvidia-driver-535-server-open

Finally test if everything is ready:

nvidia-smi

Make sure to fix nvidia-driver's version

apt-mark hold nvidia-driver-470

Note: a common issue is that when you install this driver, gdm3 also gets installed and causes your server to be asleep after an idle timeout. So make sure you delete it.

apt remove gdm3
systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

Node Feature Discovery (NFD)

Installed using this repo.

Nvidia GPU operator

These are the needed documents to implement gpu operator:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
    && chmod 700 get_helm.sh \
    && ./get_helm.sh

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

helm install --debug --wait --generate-name --timeout 50m \
     -n gpu-operator --create-namespace      nvidia/gpu-operator

Sub Admin access

The userObjects directory contains any kubernetes objects that are created by us. The access for a subAdmin is granted with roles and cluster roles within the subAdmin directory.

The certificates are handled within the certificates directory

Monitoring

You can use this repo.

Cluster Backup

The scripts to get backups are in this directory. currently only support a kubectl get all -A as backup.

make that run frequently:

cp ./backup/* /usr/bin/
echo "48 23 * * * root /usr/bin/cluster-get-all-backup.sh /usr/share/backups/get-all/
49 23 * * 5 root /usr/bin/keep-n-backups.sh /usr/share/backups/get-all/ 8 " > /etc/cron.d/backup

Private Docker Registry

To use images in private registry do as following:

docker pull <image> # pull the image locally
docker image tag <image> registry.mohsenkamini.ir:5000/<image>
docker image push registry.mohsenkamini.ir:5000/<image> # push it to the private reg

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
ansible		ansible
backup		backup
docs		docs
helm-charts/prometheus-ipmi-exporter		helm-charts/prometheus-ipmi-exporter
manifests		manifests
network		network
test		test
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCAI

📖 Table of Contents

Network & Internet setup

Kubernetes setup

GPU configuration for K8s

Drivers

Node Feature Discovery (NFD)

Nvidia GPU operator

Sub Admin access

Monitoring

Cluster Backup

Private Docker Registry

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DCAI

📖 Table of Contents

Network & Internet setup

Kubernetes setup

GPU configuration for K8s

Drivers

Node Feature Discovery (NFD)

Nvidia GPU operator

Sub Admin access

Monitoring

Cluster Backup

Private Docker Registry

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages