Distributed Computing for Artificial Intellegence over k8s. The project is deployed in SBU site.
- Network & Internet setup
- Kubernetes setup
- GPU configuration for K8s
- Sub Admin access
- Cluster Backup
- Monitoring
- Private Docker Registry
The following file will provide complete internet connection on hosts.
Do this:
vi ./network/.hotspot.sh
Add proper credentials to it, and run the following to setup network on nodes using ansible:
cd ./ansible/
vi inventory
An example for inventory could be like this:
[masters]
master1 ansible_host=master1.sbu-dcai.ir
[workers]
worker01 ansible_host=worker01.sbu-dcai.ir
worker02 ansible_host=worker02.sbu-dcai.ir
[others]
registry ansible_host=registry.sbu-dcai.ir
monitoring ansible_host=monitoring.sbu-dcai.ir
[all:vars]
host_key_checking = false
ansible_user=CHANGEME
ansible_ssh_port=CHANGEME
ansible_become=yes
ansible_ssh_private_key_file=CHANGEME
And finally:
ansible-playbook -i ./inventory ./plays/network-setup.yml
Created initial cluster using this repo.
Before adding a worker node with GPU on it, install nvidia drives using apt. You can read more on that on this article.
Note: only install drivers without prefixes. This could be a valid example:
nvidia-driver-470And here's an invalid one:
nvidia-driver-535-server-open
Finally test if everything is ready:
nvidia-smi
Make sure to fix nvidia-driver's version
apt-mark hold nvidia-driver-470
Note: a common issue is that when you install this driver,
gdm3also gets installed and causes your server to be asleep after an idle timeout. So make sure you delete it.
apt remove gdm3
systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
Installed using this repo.
These are the needed documents to implement gpu operator:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm install --debug --wait --generate-name --timeout 50m \
-n gpu-operator --create-namespace nvidia/gpu-operator
The userObjects directory contains any kubernetes objects that are created by us. The access for a subAdmin is granted with roles and cluster roles within the subAdmin directory.
The certificates are handled within the certificates directory
You can use this repo.
The scripts to get backups are in this directory. currently only support a kubectl get all -A as backup.
make that run frequently:
cp ./backup/* /usr/bin/
echo "48 23 * * * root /usr/bin/cluster-get-all-backup.sh /usr/share/backups/get-all/
49 23 * * 5 root /usr/bin/keep-n-backups.sh /usr/share/backups/get-all/ 8 " > /etc/cron.d/backup
To use images in private registry do as following:
docker pull <image> # pull the image locally
docker image tag <image> registry.mohsenkamini.ir:5000/<image>
docker image push registry.mohsenkamini.ir:5000/<image> # push it to the private reg