Multi-node training can be performed easily on Gaudi with DeepSpeed for any training script as follows:
python gaudi_spawn.py \
--hostfile path_to_my_hostfile --use_deepspeed \
path_to_my_script.py --args1 --args2 ... --argsN \
--deepspeed path_to_my_deepspeed_configwhere --argX is an argument of the script to run.
Check out the documentation to know how to set up your Gaudi instances for multi-node runs on premises or on AWS.
We provide two Dockerfile to easily start your multi-node runs:
- A
Dockerfileprovided here for multi-node runs on AWS. - A
Dockerfileprovided here for multi-node runs using GaudiNIC.
The Dockerfile is based on an image compatible with Ubuntu 22.04 but you can easily adapt it to another OS.
To build the Docker image, run:
docker build -t gaudi_multi_node PATHwhere PATH is the path to the folder containing the Dockerfile.
To run a Docker container with the image you just built, execute:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host gaudi_multi_node:latestFor AWS DL1 instances,
--privilegedmust be passed to thedocker runcommand so that EFA interfaces are visible.
You will need to copy the leader node Docker's id_rsa.pub key to every other node Docker's ~/.ssh/authorized_keys to enable password-less SSH:
a. Copy id_rsa.pub to ~/.ssh/authorized_keys on each node
cat id_rsa.pub > authorized_keys
vi authorized_keysb. Copy the leader node's id_rsa.pub key contents to other systems' authorized_keys.
Finally, on each system, add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:
ssh-keyscan -p 3022 -H 10.10.100.101 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.102 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.103 >> ~/.ssh/known_hosts
ssh-keyscan -p 3022 -H 10.10.100.104 >> ~/.ssh/known_hostsYou can check if ssh port is working with the following command:
- Run
lsof -iinside docker of each node to make sure sshd is up. It should be something like below.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sshd 35 root 3u IPv4 23262521 0t0 TCP *:3022 (LISTEN)
sshd 35 root 4u IPv6 23262523 0t0 TCP *:3022 (LISTEN)If no sshd, then do the following to restart sshd.
sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
sed -i 's/# Port 22/ Port 3022/g' /etc/ssh/ssh_config
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
service ssh restart-
Test ssh with command
ssh -p 3022 IP-addressto each other to make sure the nodes can communicate with each other. -
Try gaudi_spawn.py training command with world_size 8 for few steps to make sure the command works for 8 ranks on each node.
-
Start gaudi_spawn.py with multi-nodes run on main node docker. (the node with the 1st ip address in the hostfile)
DeepSpeed requires a hostfile to know the addresses of and the number of devices to use on each node. You can specify its path with --hostfile. This file should look like this:
ip_1 slots=8
ip_2 slots=8
...
ip_n slots=8
You can find a template here.
If you need to set environment variables for all nodes, you can specify them in a .deepspeed_env file which should be located in the local path you are executing from or in your home directory. It is formatted as follows:
env_variable_1_name=value
env_variable_2_name=value
...
You can find an example for GaudiNIC instances here.
Note above environment variables refers to /etc/profile.d/habanalabs.sh inside docker, and should set only on GaudiNIC master node.
You can find an example for AWS instances here.
Note that one should set
HCCL_OVER_OFI=1andLD_LIBRARY_PATH=/root/hccl_ofi_wrapper:/opt/amazon/openmpi/lib:/opt/amazon/efa/libonly on AWS DL1 instances. These should not be used otherwise.
- It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with
--gradient_checkpointingin all these examples or withgradient_checkpointing=Truein yourGaudiTrainingArguments. - Larger batch sizes should lead to higher speedups.
- Multi-node inference is not recommended and can provide inconsistent results.
- On AWS DL1 instances, run your Docker containers with the
--privilegedflag so that EFA devices are visible.