generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathhpc.Rmd
295 lines (245 loc) · 8.15 KB
/
hpc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
title: 'Deploying Targets on HPC'
teaching: 10
exercises: 2
---
```{R, echo=FALSE}
# Exit sensibly when Slurm isn't installed
if (!nzchar(Sys.which("sbatch"))){
knitr::knit_exit("sbatch was not detected. Likely Slurm is not installed. Exiting.")
}
```
:::::::::::::::::::::::::::::::::::::: questions
- Why would we use HPC to run Targets workflows?
- How can we run Targets workflows on Slurm?
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: objectives
- Be able to generate a report using `targets`
::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::: instructor
Episode summary: Show how to write reports with Quarto
::::::::::::::::::::::::::::::::::::::::::::::::
```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(tarchetypes)
library(quarto) # don't actually need to load, but put here so renv catches it
source("https://raw.githubusercontent.com/joelnitta/targets-workshop/main/episodes/files/functions.R?token=$(date%20+%s)") # nolint
# Increase width for printing tibbles
options(width = 140)
```
## Advantages of HPC
If your analysis involves computationally intensive or long-running tasks such as training machine learning models or processing very large amounts of data, it will quickly become infeasible to use a single machine to run this.
If you are part of an organisation with access to a High Performance Computing (HPC) cluster, you can easily leverage the numerous machines with Targets to scale up your analysis.
This differs from the exucution we have learned so far, which spawns extra R processes on the *same machine* to speed up execution.
## Configuring Targets for Slurm
Fortunately, using HPC is as simple as changing the Targets `controller`.
In this section we will assume that our HPC uses Slurm as its job scheduler, but you can easily use other schedulers such as PBS/TORQUE, Sun Grid Engine (SGE) or LSF.
In the Parallel Processing section, we used the following configuration:
```{R}
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
```
To configure this for Slurm, we just swap out the controller with a new one from the `crew.cluster` package:
```{R}
library(crew.cluster)
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R"
)
)
```
There are a number of options you can pass to `crew_controller_slurm()` to fine-tune the Slurm execution, [which you can find here](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html).
Here we are only using two:
* `workers` sets the number of jobs that are submitted to Slurm to process targets.
* `script_lines` adds some lines to the Slurm submit script used by Targets. This is useful for loading Environment Modules and adding `#SBATCH` options.
Let's run the modified workflow:
```{R, eval=FALSE}
source("R/packages.R")
source("R/functions.R")
library(crew.cluster)
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R"
)
)
tar_plan(
# Load raw data
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
),
# Get model summaries
tar_target(
model_summaries,
glance_with_mod_name_slow(models),
pattern = map(models)
),
# Get model predictions
tar_target(
model_predictions,
augment_with_mod_name_slow(models),
pattern = map(models)
)
)
```
::: challenge
## Increasing Resources
Q: How would you modify your `_targets.R` if your targets needed 200GB of RAM?
::: hint
Check the arguments for [`crew_controller_slurm`](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html#arguments-1).
:::
::: solution
```R
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R",
# Added this
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
)
```
:::
:::
## HPC Workers
Despite what you might expect, `crew` does not submit one Slurm job for each target.
Instead, it uses persistent workers, meaning that you define a pool of workers when configuring the workflow.
In our example above we used 3 workers.
For each worker, `crew` submits a single Slurm job, and these workers will process multiple targets over their lifetime.
We can verify that this has happened using `sacct`:
```{bash}
sacct
```
The upside of this approach is that we don't have to work out the minutae of how long each target takes to build, or what resources it needs.
It also means that we don't submit a lot of jobs, making our Slurm usage more efficient and easy to monitor.
The downside of this mechanism is that **the resources of the worker have to be sufficient to build each of your targets**.
::: challenge
## Choosing a Worker
Q: Say we have two targets. One uses 100 GB of RAM and 1 CPU, and the other needs 10 GB of RAM and 8 CPUs to run a multi-threaded function. What worker configuration do we use?
::: solution
We need to choose the maximum of all resources if we have a single worker.
It will need 100 GB of RAM and 8 CPUs.
To do this we might use a controller a bit like this:
```{R, results="hide"}
crew_controller_slurm(
name = "cpu_worker",
workers = 3,
script_lines = "
#SBATCH --cpus-per-task=8
module load R",
slurm_memory_gigabytes_per_cpu = 100
)
```
:::
:::
## Heterogeneous Workers
In some cases we may prefer heterogeneous workers, especially if some of our targets need a GPU and others need a CPU.
To do this, we firstly define each worker configuration by adding the `name` argument to `crew_controller_slurm`.
Note that this time we aren't passing it into `tar_option_set`:
```{R, results="hide"}
library(crew.cluster)
crew_controller_slurm(
name = "cpu_worker",
workers = 3,
script_lines = "module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
```
Then we specify this controller by name in each target definition:
```{R, results="hide"}
tar_target(
name = cpu_task,
command = run_model2(data),
resources = tar_resources(
crew = tar_resources_crew(controller = "cpu_worker")
)
)
```
::: challenge
## Mixing GPU and CPU targets
Q: Say we have the following targets workflow. How would we modify it so that `gpu_task` is only run in a GPU Slurm job?
```{R, eval=FALSE}
graphics_devices <- function(){
system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}
tar_plan(
tar_target(
cpu_hardware,
graphics_devices()
),
tar_target(
gpu_hardware,
graphics_devices()
)
)
```
::: hint
You will need to define two different crew controllers.
:::
::: solution
```R
graphics_devices <- function(){
system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}
library(crew.cluster)
crew_controller_slurm(
name = "cpu_worker"
workers = 3,
script_lines = "module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
crew_controller_slurm(
name = "gpu_worker"
workers = 3,
script_lines = "#SBATCH --gres=gpu:1
module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
tar_plan(
tar_target(
cpu_hardware,
graphics_devices(),
resources = tar_resources(
crew = tar_resources_crew(controller = "cpu_worker")
)
),
tar_target(
gpu_hardware,
graphics_devices(),
resources = tar_resources(
crew = tar_resources_crew(controller = "gpu_worker")
)
)
)
```
:::
:::
::::::::::::::::::::::::::::::::::::: keypoints
- `crew.cluster::crew_controller_slurm()` is used to configure a workflow to use Slurm
- Crew uses persistent workers on HPC, and you need to choose your resources accordingly
- You can create heterogeneous workers by using multiple calls to `crew_controller_slurm(name=)`
::::::::::::::::::::::::::::::::::::::::::::::::