Skip to content

Commit 1a6ab1d

Browse files
committed
workloads
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
1 parent ee27863 commit 1a6ab1d

File tree

6 files changed

+67
-72
lines changed

6 files changed

+67
-72
lines changed

courses/.gitignore

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Production metadata files
2+
scenes.json
3+
production.json
4+
timeline.json
5+
6+
# Production base files
7+
slides.md
8+
slides.pdf
9+
lesson.html
10+
lesson.ipynb
11+
public/
12+
13+
# Content production folder
14+
production/
15+
outputs/

courses/foundations/Observability/course.yaml

Lines changed: 0 additions & 70 deletions
This file was deleted.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
00_workload:
2+
title: Workload
3+
description: In this module, you’ll learn when to use Ray Train to scale deep learning
4+
workloads and how to train a Stable Diffusion UNet with PyTorch Lightning. You’ll
5+
build a simple Parquet-backed PyTorch Dataset/DataLoader and run single-GPU training
6+
as a baseline before moving to distributed training on a multi-GPU Ray cluster.
7+
sources:
8+
- 02b_Intro_Ray_Train_with_PyTorch_Lightning.ipynb
9+
lessons:
10+
00_lesson:
11+
title: 'Introduction to Ray Train: Ray Train + PyTorch Lightning'
12+
description: Learn when to use Ray Train and how to integrate it with PyTorch
13+
Lightning to scale model training from a single GPU to a multi-GPU Ray cluster.
14+
You’ll apply this workflow by training a Stable Diffusion model using distributed
15+
training with Ray Train.
16+
01_lesson:
17+
title: When to use Ray Train
18+
description: Learn when to use Ray Train to speed up and scale machine learning
19+
training workloads that are slow or require significant compute. This lesson
20+
explains the key challenges Ray Train addresses and how its distributed training
21+
framework helps solve them.
22+
02_lesson:
23+
title: Single GPU Training with PyTorch Lightning
24+
description: In this lesson, you’ll set up single-GPU training for a Stable
25+
Diffusion UNet using PyTorch Lightning, starting from preprocessed image and
26+
text latents stored in Parquet. You’ll build a simple custom `Dataset` and
27+
`DataLoader`, validate batch shapes/dtypes, and define a LightningModule-ready
28+
UNet configuration for training.
29+
03_lesson:
30+
title: Distributed Training with Ray Train and PyTorch Lightning
31+
description: Learn how to scale a PyTorch Lightning image training loop from
32+
a single GPU to multi-GPU Distributed Data Parallel using Ray Train. You’ll
33+
migrate your code to a Ray Train–compatible training function, configure GPU
34+
scaling with `ScalingConfig`, and launch distributed runs with `TorchTrainer`
35+
while managing checkpoints and metrics.
36+
04_lesson:
37+
title: Ray Train in Production
38+
description: Learn how Ray Train is used in real-world production workflows
39+
through a case study showing how Canva combined Ray Train and Ray Data to
40+
reduce Stable Diffusion training costs by 3.7x. You’ll see practical patterns
41+
and outcomes for scaling training efficiently and cost-effectively.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
title: Scaling Stable Diffusion Training with Ray Train
2+
description: Learn when and how to use Ray Train to scale deep learning workloads
3+
by training a Stable Diffusion UNet with PyTorch Lightning. You’ll build a Parquet-backed
4+
PyTorch Dataset/DataLoader, establish a single-GPU baseline, and then scale the
5+
same training job to multi-GPU distributed training on a Ray cluster.
6+
author: ''
7+
mediaStorage: ''
8+
category: workload
9+
thumbnail: thumbnail.png

courses/workloads/Ray_Data_Batch_Inference/course.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@ description: Learn to run scalable batch inference with Ray Data by loading a Hu
66
embeddings across an entire dataset.
77
author: ''
88
mediaStorage: ''
9-
category: foundation
9+
category: workload
1010
thumbnail: thumbnail.png

courses/workloads/Ray_Serve_Online_Serving/course.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,5 @@ description: Deploy a Hugging Face sentiment analysis model as a scalable online
55
app and Ray cluster.
66
author: ''
77
mediaStorage: ''
8-
category: foundation
8+
category: workload
99
thumbnail: thumbnail.png

0 commit comments

Comments
 (0)