How to train/fine tune the model with multiple GPUs?

I have pulled the code from branch [train](https://github.com/graykode/gpt-2-Pytorch/tree/train). Is there a way to train or fine tune the GPT-2 model with data parallelism on multiple GPUs? Thanks for your help.