This example demonstrates how to run a PyTorch training job across multiple nodes using OSMO with torchrun.
This workflow example contains:
train.py: PyTorch training script that uses torchrun for distributed training on MNISTtrain.yaml: Ready-to-use two-node training workflow configurationtrain_template.yaml: Configurable multi-node workflow template with customizable parametersosmo_barrier.py: Synchronization utility for coordinating tasks across multiple nodes
- Access to an OSMO cluster with GPU resources
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train_template.yaml
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train.py
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/osmo_barrier.py
osmo workflow submit train_template.yaml