|
| 1 | +00_workload: |
| 2 | + title: Workload |
| 3 | + description: In this module, you’ll learn when to use Ray Train to scale deep learning |
| 4 | + workloads and how to train a Stable Diffusion UNet with PyTorch Lightning. You’ll |
| 5 | + build a simple Parquet-backed PyTorch Dataset/DataLoader and run single-GPU training |
| 6 | + as a baseline before moving to distributed training on a multi-GPU Ray cluster. |
| 7 | + sources: |
| 8 | + - 02b_Intro_Ray_Train_with_PyTorch_Lightning.ipynb |
| 9 | + lessons: |
| 10 | + 00_lesson: |
| 11 | + title: 'Introduction to Ray Train: Ray Train + PyTorch Lightning' |
| 12 | + description: Learn when to use Ray Train and how to integrate it with PyTorch |
| 13 | + Lightning to scale model training from a single GPU to a multi-GPU Ray cluster. |
| 14 | + You’ll apply this workflow by training a Stable Diffusion model using distributed |
| 15 | + training with Ray Train. |
| 16 | + 01_lesson: |
| 17 | + title: When to use Ray Train |
| 18 | + description: Learn when to use Ray Train to speed up and scale machine learning |
| 19 | + training workloads that are slow or require significant compute. This lesson |
| 20 | + explains the key challenges Ray Train addresses and how its distributed training |
| 21 | + framework helps solve them. |
| 22 | + 02_lesson: |
| 23 | + title: Single GPU Training with PyTorch Lightning |
| 24 | + description: In this lesson, you’ll set up single-GPU training for a Stable |
| 25 | + Diffusion UNet using PyTorch Lightning, starting from preprocessed image and |
| 26 | + text latents stored in Parquet. You’ll build a simple custom `Dataset` and |
| 27 | + `DataLoader`, validate batch shapes/dtypes, and define a LightningModule-ready |
| 28 | + UNet configuration for training. |
| 29 | + 03_lesson: |
| 30 | + title: Distributed Training with Ray Train and PyTorch Lightning |
| 31 | + description: Learn how to scale a PyTorch Lightning image training loop from |
| 32 | + a single GPU to multi-GPU Distributed Data Parallel using Ray Train. You’ll |
| 33 | + migrate your code to a Ray Train–compatible training function, configure GPU |
| 34 | + scaling with `ScalingConfig`, and launch distributed runs with `TorchTrainer` |
| 35 | + while managing checkpoints and metrics. |
| 36 | + 04_lesson: |
| 37 | + title: Ray Train in Production |
| 38 | + description: Learn how Ray Train is used in real-world production workflows |
| 39 | + through a case study showing how Canva combined Ray Train and Ray Data to |
| 40 | + reduce Stable Diffusion training costs by 3.7x. You’ll see practical patterns |
| 41 | + and outcomes for scaling training efficiently and cost-effectively. |
0 commit comments