Skip to content

Latest commit

 

History

History
40 lines (29 loc) · 1.64 KB

File metadata and controls

40 lines (29 loc) · 1.64 KB

TorchRun: Training on Multiple Nodes

This example demonstrates how to run a PyTorch training job across multiple nodes using OSMO with torchrun.

This workflow example contains:

  • train.py: PyTorch training script that uses torchrun for distributed training on MNIST
  • train.yaml: Ready-to-use two-node training workflow configuration
  • train_template.yaml: Configurable multi-node workflow template with customizable parameters
  • osmo_barrier.py: Synchronization utility for coordinating tasks across multiple nodes

Prerequisites

  • Access to an OSMO cluster with GPU resources

Running this workflow

curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train_template.yaml
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/train.py
curl -O https://raw.githubusercontent.com/NVIDIA/OSMO/main/cookbook/dnn_training/torchrun_multinode/osmo_barrier.py
osmo workflow submit train_template.yaml