HPC-launcher Repository

The HPC launcher repository contains a set of helpful scripts and Python bindings for launching PyTorch (torchrun), LBANN 2.0 (PyTorch-core), or generic scripts on multiple leadership-class HPC systems. There are optimized routines for FLUX, SLURM, and LSF launchers. Additionally, there are optimized environments for systems at known compute centers. Currently there are supported systems at:

LLNL Livermore Computing (LC)
LBL NERSC (Pending)
ORNL OLCF (Pending)
RIKEN (Pending)

There are two main entry points into HPC-Launcher from the cli: launch and torchrun-hpc. torchrun-hpc is intended as a replacement for torchrun, while launch is a generic interface for launching parallel jobs.

Installation

To install the package, install released versions from PyPI run:

pip install hpc-launcher

Or install directly from GitHub:

pip install git+https://github.com/LBANN/HPC-launcher.git

Example Usage

Using the launch command to execute a command in parallel

launch -N1 -n1 hostname

Using the torchrun-hpc command to execute a PyTorch Python file in parallel on two nodes and four processes per node (8 in total):

torchrun-hpc -N2 -n4 file.py [arguments to Python file]

Using HPC-Launcher within existing PyTorch code with explicity invoking it from the command line (CLI). Within the top level Python file, import hpc_launcher.torch first to ensure that torch is configured per HPC-Launcher's specification.

import hpc_launcher.torch

CLI options for HPC-Launcher `launch` and `torchrun-hpc` commands

launch - General purpose HPC job launcher
torchrun-hpc - PyTorch-specific distributed training launcher

LBANN: Livermore Big Artificial Neural Network Toolkit

The Livermore Big Artificial Neural Network toolkit (LBANN) is an open-source, HPC-centric, deep learning training framework that is optimized to compose multiple levels of parallelism.

LBANN provides model-parallel acceleration through domain decomposition to optimize for strong scaling of network training. It also allows for composition of model-parallelism with both data parallelism and ensemble training methods for training large neural networks with massive amounts of data. LBANN is able to advantage of tightly-coupled accelerators, low-latency high-bandwidth networking, and high-bandwidth parallel file systems.

LBANN v2.x is composed of a custom backend LBANN device that is used to provide processor-centric optimizations such as copy-elision for AMD MI300A APUs. Additionally, it is composed of Python, C++, CUDA, and ROCm custom kernels that extend PyTorch 2.4+. Libraries such as DGraph, DistConv, and CheckMate, implement key algorithms using the PyTorch 2.x API. Each of these libraries should be both composable as well as fully separable. The suite of LBANN 2.x optimizations are found in the LBANN GitHub group.

Publications

A list of publications, presentations and posters are shown here.

Reporting issues

Issues, questions, and bugs can be raised on the Github issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
hpc_launcher		hpc_launcher
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS		CONTRIBUTORS
LBANN_2.0_Block_Diagram.png		LBANN_2.0_Block_Diagram.png
LICENSE		LICENSE
README.md		README.md
launch_cli.md		launch_cli.md
setup.py		setup.py
torchrun-hpc_cli.md		torchrun-hpc_cli.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPC-launcher Repository

Installation

Example Usage

CLI options for HPC-Launcher `launch` and `torchrun-hpc` commands

LBANN: Livermore Big Artificial Neural Network Toolkit

Publications

Reporting issues

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

LBANN/HPC-launcher

Folders and files

Latest commit

History

Repository files navigation

HPC-launcher Repository

Installation

Example Usage

CLI options for HPC-Launcher launch and torchrun-hpc commands

LBANN: Livermore Big Artificial Neural Network Toolkit

Publications

Reporting issues

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

CLI options for HPC-Launcher `launch` and `torchrun-hpc` commands

Packages