3.4-PP_Rscript.qmd

---
title: "Parallel Processing Methods"
subtitle: "Rscript"
author:
  - Elizabeth King
  - Kevin Middleton
format:
  revealjs:
    theme: [default, custom.scss]
    standalone: true
    self-contained: true
    logo: QMLS_Logo.png
    slide-number: true
    show-slide-number: all
code-annotations: hover
---

## Introduction to Parallelization in R

```{r}
#| label: setup
#| echo: false
#| warning: false
#| message: false

library(tidyverse)
library(cowplot)
library(readxl)
library(viridis)
library(parallel)
library(parallelly)
library(furrr)

ggplot2::theme_set(theme_cowplot(font_size = 18))
```

1. Within a single R script
    - Base R: `parallel`
        - [parallely](https://parallelly.futureverse.org/)
    - [foreach](https://github.com/RevolutionAnalytics/foreach) and `%dopar%` from `doParallel`
    - [future](https://www.futureverse.org/) and [furrr](https://furrr.futureverse.org/)
2. Using bash to run many R scripts via a scheduler (e.g., SLURM)
    

## Using bash to run many R scripts

- Use `.R` not `.Rmd`
- Run the script from the shell
- Pass inputs & collect outputs
    - Consider number of files  
- Sometimes base R will behave preferably compared to tidyverse 

## Non-parallel

```{r}
#| echo: true

MM <- read_excel("Data/Jackals.xlsx")

obs <- MM |> filter(Sex == "F") |> summarize(m = mean(Mandible)) |> pull(m) -
  MM |> filter(Sex == "M") |> summarize(m = mean(Mandible)) |> pull(m)

set.seed(3850234) # <1>

nreps <- 1e4
diffs <- numeric(length = nreps)
diffs[1] <- obs

for (ii in 2:nreps) { # <2>
  Rand_Sex <- sample(MM$Sex)
  diffs[ii] <- mean(MM$Mandible[Rand_Sex == "F"]) -
    mean(MM$Mandible[Rand_Sex == "M"])
}

#two-tailed test
round(mean(abs(diffs) >= abs(diffs[1])), 4) # <3>

```

1. Setting different seeds for iterations
2. Splitting up iterations
3. Keeping and combining the output

## `Rscript`

```{bash}
#| echo: true

Rscript my_R_file.R variables >console_output.txt 2>errors_and_warnings.txt &

```
- You can run R from the shell with the command `Rscript`
- Any variables you pass come in as characters
    - `as.numeric()` if you need numbers
- Allows you to start many R processes 


## Setting up the script

```{r}
#| echo: true
#| eval: false

args=(commandArgs(TRUE)) # <1>

MM <- read_excel("Data/Jackals.xlsx")

set.seed(as.numeric(args[1])) # <2> 
iter_id <- args[2] # <3>

nreps <- 100 # <4>
diffs <- numeric(length = nreps)

for (ii in 1:nreps) 
{
  Rand_Sex <- sample(MM$Sex)
  diffs[ii] <- mean(MM$Mandible[Rand_Sex == "F"]) -
    mean(MM$Mandible[Rand_Sex == "M"])
}

saveRDS(file = paste0("/Output/jackal_iters_", iter_id, ".Rds")) # <5>

```

1. take arguments from bash
2. seed will be passed to script
3. track which set this is
4. do 100 iterations per core
5. keep the output as a Rds file (keeps R attributes)

## Setting up the bash commands

```{r}
#| echo: true
#| eval: false

set.seed(87239)
ncores <- 10
iter_ids <- 1:ncores
seeds <- sample(1:10000000, ncores)

cat("", file = "jackal_cmds.txt", append = FALSE)

for(jj in 1:ncores)
{
  cat(paste0("Rscript Jackals_iter.R ", 
             seeds[jj], " ",
             iter_ids[jj], 
             " >temp", jj,".txt",
             " 2>error", jj,".txt",
             " \n"), 
      file = "jackal_cmds.txt",
      append = TRUE)
}

```

## Setting up the bash commands

```{bash}
#| echo: true
#| eval: false

head jackal_cmds.txt

```

```{r}
#| echo: false
#| eval: true

set.seed(87239)
ncores <- 10
iter_ids <- 1:ncores
seeds <- sample(1:10000000, ncores)

for(jj in 1:ncores)
{
  cat(paste0("Rscript Jackals_iter.R ", 
             seeds[jj], " ",
             iter_ids[jj], 
             " >temp", jj,".txt",
             " 2>error", jj,".txt",
             " \n"))
}

```

## Running the commands

## Processing the results

```{r}
#| echo: true
#| eval: false

MM <- read_excel("Data/Jackals.xlsx")

obs <- MM |> filter(Sex == "F") |> summarize(m = mean(Mandible)) |> pull(m) -
  MM |> filter(Sex == "M") |> summarize(m = mean(Mandible)) |> pull(m)

ncores <- 10
iter_ids <- 1:ncores
nreps <- 100

alldiffs <- rep(NA, ncores*nreps)
counter <- 1  

for(jj in 1:ncores)
{
  diffs <- readRDS(file = paste0("/Output/jackal_iters_", iter_ids[jj], ".Rds"))

  alldiffs[counter:(counter + 99)] <- diffs

  counter <- counter + 100
  
}

empP <- mean(abs(alldiffs) >= abs(obs))

```

## Learn more

- [Software Carpentry](https://software-carpentry.org/lessons/)
- [RSS Help Pages](https://docs.rnet.missouri.edu/)
- [High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html)