Tutorial on Distributed Training of Deep Neural Networks

All the code for the hands-on exercies can be found in this repository.

Table of Contents

Setup
Basics of Model Training
Data Parallelism
Tensor Parallelism
Inference

Setup

To request an account on Zaratan, please join slack at the link above, and fill this Google form.

We have pre-built the dependencies required for this tutorial on Zaratan. This will be activated automatically when you run the bash scripts.

Model weights and the training dataset have been downloaded in /scratch/zt1/project/sc25/shared/.

Basics of Model Training

Using PyTorch Lightning

CONFIG_FILE=configs/single_gpu.json sbatch train_single.sh

Mixed Precision

Open configs/single_gpu.json and change precision to bf16-mixed and then run -

CONFIG_FILE=configs/single_gpu.json sbatch train_single.sh

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

CONFIG_FILE=configs/ddp.json sbatch train_multi.sh

Fully Sharded Data Parallelism (FSDP)

CONFIG_FILE=configs/fsdp.json sbatch train_multi.sh

Tensor Parallelism

CONFIG_FILE=configs/axonn.json sbatch train_multi.sh

Inference

Add more prompts to data/inference/prompts.txt if you want. Then run

CONFIG_FILE=configs/inference_yalis.json sbatch infer_single.sh

With torch.compile

Open infer.sh and change YALIS_DISABLE_COMPILE from 1 to 0. Then run

CONFIG_FILE=configs/inference_yalis.json sbatch infer_single.sh

With cuda graphs

Open infer.sh and change YALIS_DISABLE_DECODE_CUDAGRAPHS from 1 to 0 (make sure torch compile is also enabled). Then run

CONFIG_FILE=configs/inference_yalis.json sbatch infer_single.sh

With tensor parallelism

CONFIG_FILE=configs/inference_yalis.json sbatch infer_multi.sh

Online Inference with VLLM

Query the vllm server we setup as follows:

# Usage: ./llm_request.sh <server_ip> "<prompt>" [max_tokens]

./llm_request.sh <vLLM Server IP> "San Francisco is a" 64

Change the prompt and the max tokens argument to play around with command

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
configs		configs
data		data
external		external
slides		slides
.gitignore		.gitignore
README.md		README.md
args.py		args.py
get_rank.sh		get_rank.sh
infer.py		infer.py
infer_multi.sh		infer_multi.sh
infer_single.sh		infer_single.sh
llm_request.sh		llm_request.sh
requirements.txt		requirements.txt
train.py		train.py
train_multi.sh		train_multi.sh
train_single.sh		train_single.sh
utils.py		utils.py
vllm_serve.sh		vllm_serve.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tutorial on Distributed Training of Deep Neural Networks

Setup

Basics of Model Training

Using PyTorch Lightning

Mixed Precision

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

Fully Sharded Data Parallelism (FSDP)

Tensor Parallelism

Inference

With torch.compile

With cuda graphs

With tensor parallelism

Online Inference with VLLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

axonn-ai/distrib-dl-tutorial

Folders and files

Latest commit

History

Repository files navigation

Tutorial on Distributed Training of Deep Neural Networks

Setup

Basics of Model Training

Using PyTorch Lightning

Mixed Precision

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

Fully Sharded Data Parallelism (FSDP)

Tensor Parallelism

Inference

With torch.compile

With cuda graphs

With tensor parallelism

Online Inference with VLLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages