TorchRun: Training on Multiple Nodes

This example demonstrates how to run a PyTorch training job across multiple nodes using OSMO with torchrun.

This workflow example contains:

train.py: PyTorch training script that uses torchrun for distributed training on MNIST
train.yaml: Ready-to-use two-node training workflow configuration
train_template.yaml: Configurable multi-node workflow template with customizable parameters
osmo_barrier.py: Synchronization utility for coordinating tasks across multiple nodes

Prerequisites

Access to an OSMO cluster with GPU resources

Running this workflow

curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train_template.yaml
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train.py
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/osmo_barrier.py
osmo workflow submit train_template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchRun: Training on Multiple Nodes

Prerequisites

Running this workflow

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TorchRun: Training on Multiple Nodes

Prerequisites

Running this workflow