Skip to content

Alterations to build system

Alan O'Callaghan edited this page Jun 28, 2022 · 1 revision

Background

This page describes the changes to the build system (chiefly the Makefile) to handle Bioconductor packages, to manage R package dependencies, and to reduce build time.

Outline

First, I will briefly describe the default build system for the generation of carpentries sites that this repo is based on. It's worth noting that a new system is being designed as of writing (June 2022), but it is uncertain when this will be released. I will then describe the changes to this system that I made to make these

Aims

The aims of the changes I made to the build system were as follows:

  • To support Bioconductor packages
  • To produce a tab-delimited set of dependencies for the management of workshop computational environments (eg, Rstudio)
  • To reduce the runtime when building the site locally (eg, not re-installing packages every time)
  • To ensure that all relevant resources (including data and figures) are rebuilt when their source code is updated

Standard Carpentries build system

The basic rule for the standard build system is make site, which rebuilds the markdown pages that are rendered by jekyll on github pages. For RMarkdown pages, this involves running rmarkdown::render on each Rmd file in _episodes_rmd and rendering to _episodes.

Before building an rmarkdown page, the Makefile runs the install-rmd-deps on that Rmd file, as follows:

## * install-rmd-deps : Install R packages dependencies to build the RMarkdown lesson
install-rmd-deps:
	@${SHELL} bin/install_r_deps.sh

which is just:

Rscript -e "source(file.path('bin', 'dependencies.R')); install_required_packages(); install_dependencies(identify_dependencies())"

which install the required packages c("rprojroot", "desc", "remotes", "renv") (needed to identify and install dependencies), and then runs identify_dependencies, which parses the Rmd file to identify all library (etc) calls in the code chunks that are run. ie, if the chunk option eval=FALSE it won't count as a dependency. It uses renv::dependencies() to do that, which means it cannot parse library() calls in in callout or exercise blocks (due to the leading > ). It also installs any dependencies of the bin` directory (as you might expect, we need the deps of the helper functions).

We run install_dependencies on this list, which internally dumps the list of dependencies into a mock DESCRIPTION file, and runs remotes::install_deps(). install_deps() then thinks we're in an R package directory and tries to install all the dependencies we've listed in the mock DESCRIPTION.

In an ideal world this means we've got all our dependencies installed, which means we can now render the Rmd file:

	@mkdir -p _episodes
	@bin/knit_lessons.sh $< $@

Alterations

Now, how did I change this? Let's recall the motivations:

  1. To support Bioconductor packages
  2. To produce a tab-delimited set of dependencies for the management of workshop computational environments (eg, Rstudio)
  3. To reduce the runtime when building the site locally (eg, not re-installing packages every time)
  4. To ensure that all relevant resources (including data and figures) are rebuilt when their source code is updated

Why are these problems?

  1. remotes::install_deps() can't tell when we've specified Bioconductor packages, so unless options("repos") is set, we won't be able to install any bioc packages
  2. We dump the dependencies into a mock DESCRIPTION file, but this isn't a full list! It's only the deps for the package being listed, and doesn't have recursive dependencies. Plus, the mock DESCRIPTION is a build artifact that's removed after the deps are installed.
  3. Every time we try to build an Rmd, we try to rebuild the dependencies. That means a lot of wasted time when maybe all the packages are already installed.
  4. make site assumes that whatever data/figures we use outside of those directly made by the Rmd are fixed, rather than maybe being generated by other scripts in the repo (like data-raw in an R package).

So how did we solve these?

  1. Change remotes::install_deps to use BiocManager::install(). This means we can't dump everything into a DESCRIPTION any more.
  2. We add a step to create a plain text list of dependencies, dependencies.csv:
    dependencies.csv: _episodes_rmd/*.Rmd
    	@${SHELL} bin/list_r_deps.sh
    
    This rule is similar to the one I described earlier, but here we dump the list of dependencies identified using identify_dependencies into dependencies.csv.
  3. We also run renv::dependencies on the fig directory, in case we have any R scripts there that also need dependencies installed.
  4. We add rules to create any figures from R scripts, eg:
    fig/pendulum.gif: fig/pca-animation.R
    	Rscript $<
    
    fig/kmeans.gif: fig/kmeans.R
    	Rscript $<
    
    We also need to ensure that all these figures are prerequisites of the site rule.
    We also added rules for generating data from similarly-named R scripts:
    data/%.rds: data/%.R
     Rscript $<
    
    We also need to ensure that we re-generate all the data before rendering the Rmds, so we define a list of all the datasets as DATA_DST and ensure that this is listed as a prerequisite of the site rule (similar to figures).

That summarises most of the changes made to the build system. If anything is unclear please get in touch with me (Alan) or open an issue on this repo.

There's also some functionality for building slides automatically from the lesson material which I have not covered here as, as far as I know, it's not currently used.

Clone this wiki locally