episodes/hpc.Rmd

---
title: 'Deploying Targets on HPC'
teaching: 10
exercises: 2
---

```{R, echo=FALSE}
# Exit sensibly when Slurm isn't installed 
if (!nzchar(Sys.which("sbatch"))){
  knitr::knit_exit("sbatch was not detected. Likely Slurm is not installed. Exiting.")
}
```

:::::::::::::::::::::::::::::::::::::: questions 

- Why would we use HPC to run Targets workflows?
- How can we run Targets workflows on Slurm?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Be able to generate a report using `targets`

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: instructor

Episode summary: Show how to write reports with Quarto

::::::::::::::::::::::::::::::::::::::::::::::::

```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(tarchetypes)
library(quarto) # don't actually need to load, but put here so renv catches it
source("https://raw.githubusercontent.com/joelnitta/targets-workshop/main/episodes/files/functions.R?token=$(date%20+%s)") # nolint

# Increase width for printing tibbles
options(width = 140)
```

## Advantages of HPC

If your analysis involves computationally intensive or long-running tasks such as training machine learning models or processing very large amounts of data, it will quickly become infeasible to use a single machine to run this.
If you are part of an organisation with access to a High Performance Computing (HPC) cluster, you can easily leverage the numerous machines with Targets to scale up your analysis.
This differs from the exucution we have learned so far, which spawns extra R processes on the *same machine* to speed up execution.

## Configuring Targets for Slurm

Fortunately, using HPC is as simple as changing the Targets `controller`.
In this section we will assume that our HPC uses Slurm as its job scheduler, but you can easily use other schedulers such as PBS/TORQUE, Sun Grid Engine (SGE) or LSF.

In the Parallel Processing section, we used the following configuration:
```{R}
library(crew)
tar_option_set(
  controller = crew_controller_local(workers = 2)
)
```
To configure this for Slurm, we just swap out the controller with a new one from the `crew.cluster` package:

```{R}
library(crew.cluster)
tar_option_set(
  controller = crew_controller_slurm(
    workers = 3,
    script_lines = "module load R"
  )
)
```

There are a number of options you can pass to `crew_controller_slurm()` to fine-tune the Slurm execution, [which you can find here](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html).
Here we are only using two:

  * `workers` sets the number of jobs that are submitted to Slurm to process targets.
  * `script_lines` adds some lines to the Slurm submit script used by Targets. This is useful for loading Environment Modules and adding `#SBATCH` options.

Let's run the modified workflow:

```{R, eval=FALSE}
source("R/packages.R")
source("R/functions.R")

library(crew.cluster)
tar_option_set(
  controller = crew_controller_slurm(
    workers = 3,
    script_lines = "module load R"
  )
)

tar_plan(
  # Load raw data
  tar_file_read(
    penguins_data_raw,
    path_to_file("penguins_raw.csv"),
    read_csv(!!.x, show_col_types = FALSE)
  ),
  # Clean data
  penguins_data = clean_penguin_data(penguins_data_raw),
  # Build models
  models = list(
    combined_model = lm(
      bill_depth_mm ~ bill_length_mm, data = penguins_data),
    species_model = lm(
      bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
    interaction_model = lm(
      bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
  ),
  # Get model summaries
  tar_target(
    model_summaries,
    glance_with_mod_name_slow(models),
    pattern = map(models)
  ),
  # Get model predictions
  tar_target(
    model_predictions,
    augment_with_mod_name_slow(models),
    pattern = map(models)
  )
)
```

::: challenge
## Increasing Resources

Q: How would you modify your `_targets.R` if your targets needed 200GB of RAM?

::: hint
Check the arguments for [`crew_controller_slurm`](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html#arguments-1).
:::
::: solution
```R
tar_option_set(
  controller = crew_controller_slurm(
    workers = 3,
    script_lines = "module load R",
    # Added this
    slurm_memory_gigabytes_per_cpu = 200,
    slurm_cpus_per_task = 1
  )
)
```
:::
:::

## HPC Workers

Despite what you might expect, `crew` does not submit one Slurm job for each target. 
Instead, it uses persistent workers, meaning that you define a pool of workers when configuring the workflow.
In our example above we used 3 workers.
For each worker, `crew` submits a single Slurm job, and these workers will process multiple targets over their lifetime.

We can verify that this has happened using `sacct`:

```{bash}
sacct
```

The upside of this approach is that we don't have to work out the minutae of how long each target takes to build, or what resources it needs.
It also means that we don't submit a lot of jobs, making our Slurm usage more efficient and easy to monitor.

The downside of this mechanism is that **the resources of the worker have to be sufficient to build each of your targets**.

::: challenge
## Choosing a Worker

Q: Say we have two targets. One uses 100 GB of RAM and 1 CPU, and the other needs 10 GB of RAM and 8 CPUs to run a multi-threaded function. What worker configuration do we use?

::: solution
We need to choose the maximum of all resources if we have a single worker.
It will need 100 GB of RAM and 8 CPUs.
To do this we might use a controller a bit like this:
```{R, results="hide"}
crew_controller_slurm(
  name = "cpu_worker",
  workers = 3,
  script_lines = "
#SBATCH --cpus-per-task=8
module load R",
  slurm_memory_gigabytes_per_cpu = 100
)
```
:::
:::

## Heterogeneous Workers

In some cases we may prefer heterogeneous workers, especially if some of our targets need a GPU and others need a CPU.
To do this, we firstly define each worker configuration by adding the `name` argument to `crew_controller_slurm`.
Note that this time we aren't passing it into `tar_option_set`:

```{R, results="hide"}
library(crew.cluster)
crew_controller_slurm(
  name = "cpu_worker",
  workers = 3,
  script_lines = "module load R",
  slurm_memory_gigabytes_per_cpu = 200,
  slurm_cpus_per_task = 1
)
```

Then we specify this controller by name in each target definition:

```{R, results="hide"}
tar_target(
  name = cpu_task,
  command = run_model2(data),
  resources = tar_resources(
    crew = tar_resources_crew(controller = "cpu_worker")
  )
)
```

::: challenge
## Mixing GPU and CPU targets

Q: Say we have the following targets workflow. How would we modify it so that `gpu_task` is only run in a GPU Slurm job?
```{R, eval=FALSE}
graphics_devices <- function(){
  system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}

tar_plan(
  tar_target(
    cpu_hardware,
    graphics_devices()
  ),
  tar_target(
    gpu_hardware,
    graphics_devices()
  )
)
```

::: hint
You will need to define two different crew controllers.
:::
::: solution
```R
graphics_devices <- function(){
  system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}

library(crew.cluster)
crew_controller_slurm(
  name = "cpu_worker"
  workers = 3,
  script_lines = "module load R",
  slurm_memory_gigabytes_per_cpu = 200,
  slurm_cpus_per_task = 1
)
crew_controller_slurm(
  name = "gpu_worker"
  workers = 3,
  script_lines = "#SBATCH --gres=gpu:1
module load R",
  slurm_memory_gigabytes_per_cpu = 200,
  slurm_cpus_per_task = 1
)

tar_plan(
  tar_target(
    cpu_hardware,
    graphics_devices(),
    resources = tar_resources(
      crew = tar_resources_crew(controller = "cpu_worker")
    )
  ),
  tar_target(
    gpu_hardware,
    graphics_devices(),
    resources = tar_resources(
      crew = tar_resources_crew(controller = "gpu_worker")
    )
  )
)
```
:::
:::

::::::::::::::::::::::::::::::::::::: keypoints 

- `crew.cluster::crew_controller_slurm()` is used to configure a workflow to use Slurm
- Crew uses persistent workers on HPC, and you need to choose your resources accordingly
- You can create heterogeneous workers by using multiple calls to `crew_controller_slurm(name=)`

::::::::::::::::::::::::::::::::::::::::::::::::