-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3.3-PP_in_R.qmd
347 lines (248 loc) · 7.37 KB
/
3.3-PP_in_R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "Parallel Processing Methods"
subtitle: "Within R"
author:
- Elizabeth King
- Kevin Middleton
format:
revealjs:
theme: [default, custom.scss]
standalone: true
self-contained: true
logo: QMLS_Logo.png
slide-number: true
show-slide-number: all
code-annotations: hover
bibliography: Randomization.bib
csl: evolution.csl
---
## Parallel Processing
```{r}
#| label: setup
#| echo: false
#| warning: false
#| message: false
library(tidyverse)
library(cowplot)
library(readxl)
library(viridis)
library(parallel)
library(parallelly)
library(furrr)
library(tictoc)
ggplot2::theme_set(theme_cowplot(font_size = 18))
```
- Monte Carlo methods are often computationally intensive
- Many iterations
- Often low memory usage
- Iterations are well suited to parallelization
- Run sets of iterations on each core
- Often get time close to 1 / n cores
## Local and Remote Computers
- Your computer likely has at least 4 cores
- Clusters located elsewhere (remote) often have hundreds to thousands of compute cores
- e.g., the Lewis/Hellbender clusters at Mizzou

## Resources for Using Remote Computers
- [Software Carpentry](https://software-carpentry.org/lessons/)
- [RSS Help Pages](https://docs.rnet.missouri.edu/)
- [High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html)
- Many packages have functions that can use parallelization
## Introduction to Parallelization in R
1. Within a single R script
- Base R: `parallel`
- [parallely](https://parallelly.futureverse.org/)
- [foreach](https://github.com/RevolutionAnalytics/foreach) and `%dopar%` from `doParallel`
- [future](https://www.futureverse.org/) and [furrr](https://furrr.futureverse.org/)
2. Using bash to run many R scripts via a scheduler (e.g., SLURM)
## Non-parallel
```{r}
#| echo: true
MM <- read_excel("Data/Jackals.xlsx")
obs <- MM |> filter(Sex == "F") |> summarize(m = mean(Mandible)) |> pull(m) -
MM |> filter(Sex == "M") |> summarize(m = mean(Mandible)) |> pull(m)
set.seed(3850234) # <1>
nreps <- 1e4
diffs <- numeric(length = nreps)
diffs[1] <- obs
for (ii in 2:nreps) { # <2>
Rand_Sex <- sample(MM$Sex)
diffs[ii] <- mean(MM$Mandible[Rand_Sex == "F"]) -
mean(MM$Mandible[Rand_Sex == "M"])
}
# Two-tailed test
round(mean(abs(diffs) >= abs(diffs[1])), 4) # <3>
```
1. Setting different seeds for iterations
2. Splitting up iterations
3. Keeping and combining the output
## Timing code in R
- `Sys.time()`
- `t0 <- Sys.time()`
- `Sys.time() - t0`
- `tictoc` package
- `tic()`
- `toc()`
- [bench](https://bench.r-lib.org/) and [microbenchmark](https://cran.r-project.org/web/packages/microbenchmark/index.html) packages
## Is it worth parallelizing?
```{r}
#| label: tictoc
#| echo: false
eta <- tibble(nreps = c(1000, 5000, 10000, 25000, 50000, 75000, 100000),
eta = numeric(7))
for (jj in 1:nrow(eta)) {
nreps <- eta$nreps[jj]
t0 <- Sys.time()
for (ii in 2:nreps) {
Rand_Sex <- sample(MM$Sex)
diffs[ii] <- mean(MM$Mandible[Rand_Sex == "F"]) -
mean(MM$Mandible[Rand_Sex == "M"])
}
eta$eta[jj] <- Sys.time() - t0
}
ggplot(eta, aes(nreps, eta)) +
geom_point(size = 3, color = "dodgerblue4") +
labs(x = "Iterations", y = "Time (s)")
```
## `parallely`
```{r}
#| echo: true
library(parallel)
library(parallelly)
(ncores <- parallelly::availableCores(omit = 1))
cl <- makeClusterPSOCK(rep("localhost", times = ncores),
dryrun = FALSE)
print(cl)
# Other code goes here
```
Manually stop the cluster:
```{r}
#| echo: true
stopCluster(cl)
```
## Converting a loop for parallelization
Notice that we don't use `ii` in the function. It's there for "accounting" later on.
```{r}
#| echo: true
rand_diffs <- function(ii, MM) {
Rand_Sex <- sample(MM$Sex)
diffs <- mean(MM$Mandible[Rand_Sex == "F"]) -
mean(MM$Mandible[Rand_Sex == "M"])
return(diffs)
}
rand_diffs(1, MM)
rand_diffs(1, MM)
rand_diffs(1, MM)
```
## `lapply()`: Apply a function to a list or vector
The sequence 1 to 10 is passed as the first argument to the function
```{r}
#| echo: true
lapply(seq_len(10), FUN = rand_diffs, MM = MM) |>
as.numeric()
```
## Parallelizing `lapply()`
`mclapply()` and `parLapply()`
- Manually start and stop the cluster
- Tend to the random number generator
OS differences
- Linux & MacOS: `mclapply()`
- Windows: `parLapply()`
## `mclapply()`
```{r}
#| echo: true
if (Sys.info()[['sysname']] != "Windows") {
RNGkind("L'Ecuyer-CMRG") # <1>
set.seed(123, "L'Ecuyer")
mclapply(seq_len(10), FUN = rand_diffs, MM = MM,
mc.set.seed = TRUE,
mc.cores = ncores) |> as.numeric()
mclapply(seq_len(10), FUN = rand_diffs, MM = MM,
mc.set.seed = TRUE,
mc.cores = ncores) |> as.numeric()
}
```
1. Set up a parallel-friendly random number generator
```{r}
#| echo: true
if (Sys.info()[['sysname']] != "Windows") {
RNGkind("Mersenne-Twister") # <2>
mclapply(seq_len(10), FUN = rand_diffs, MM = MM,
mc.set.seed = TRUE,
mc.cores = ncores) |> as.numeric()
}
```
2. Reset to the default R random number generator
## Compare timings
```{r}
#| label: compare_lapply_mclapply
#| echo: true
if (Sys.info()[['sysname']] != "Windows") {
tic()
x <- lapply(seq_len(1e6), FUN = rand_diffs, MM = MM) |> as.numeric()
toc()
tic()
x <- mclapply(seq_len(1e6), FUN = rand_diffs, MM = MM,
mc.set.seed = TRUE,
mc.cores = ncores) |> as.numeric()
toc()
}
```
## The futureverse
- More information: [https://www.futureverse.org/](https://www.futureverse.org/) [@Bengtsson2021-su]
- Unification of parallelization for R
- No windows vs. non-windows code differences
- "Drop in" replacements for
- `lapply()` $\rightarrow$ future_lapply() (`future.apply` package)
- `map()` $\rightarrow$ `future_map()` (`furrr` package)
## `purrr` and `furrr`
- `purrr::map_` family of functions
- `furrr::future_map_` analogues of `map_`
- `plan()` to setup how to the analysis will run
- Easily switch back and forth between parallel and non-parallel (sequential)
- Defaults to sequential
- Works well with Windows (other parallelization methods often don't)
## `purrr` and `furrr`
```{r}
#| echo: true
map_dbl(.x = seq_len(10),
.f = rand_diffs,
MM = MM)
future_map_dbl(.x = seq_len(10),
.f = rand_diffs,
MM = MM,
.options = furrr_options(seed = TRUE))
```
`.options = furrr_options(seed = TRUE)` handles setting the random number generator
## Parallelization with `furrr`
```{r}
#| echo: true
plan(multisession, workers = ncores)
future_map_dbl(.x = seq_len(10), .f = rand_diffs, MM = MM,
.options = furrr_options(seed = TRUE))
```
Back to sequential evaluation:
```{r}
#| echo: true
plan(sequential)
future_map_dbl(.x = seq_len(10), .f = rand_diffs, MM = MM,
.options = furrr_options(seed = TRUE))
```
## How much faster?
```{r}
#| label: furrr
#| echo: true
plan(multisession, workers = ncores)
tic()
x <- future_map_dbl(.x = seq_len(1e6), .f = rand_diffs, MM = MM,
.options = furrr_options(seed = TRUE))
toc()
tic()
plan(sequential)
x <- future_map_dbl(.x = seq_len(1e6), .f = rand_diffs, MM = MM,
.options = furrr_options(seed = TRUE))
toc()
```
## References
::: {#refs}
:::