Skip to content

[WIP] Issue482 job array episode not templated#489

Draft
tkphd wants to merge 9 commits intomainfrom
issue482-Job_array_episode_not_templated
Draft

[WIP] Issue482 job array episode not templated#489
tkphd wants to merge 9 commits intomainfrom
issue482-Job_array_episode_not_templated

Conversation

@tkphd
Copy link
Copy Markdown
Contributor

@tkphd tkphd commented Apr 2, 2026

This PR addresses some deficiencies of #481 and resolves some issues from #482.

Touching on job arrays is a good start. This first draft does that, but inside of a challenge without any context or actual introduction of the concept, before asking learners to apply what they've been taught -- so it is a very rough draft.


Do not merge it until it is made abundantly clear in the below conversation that it is time to do so.

Do not merge without building a clean, local copy and reviewing it for errors.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Thank you!

Thank you for your pull request 😃

🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}.

If you have files that automatically render output (e.g. R Markdown), then you should check for the following:

  • 🎯 correct output
  • 🖼️ correct figures
  • ❓ new warnings
  • ‼️ new errors

Rendered Changes

🔍 Inspect the changes: https://github.com/carpentries-incubator/hpc-intro/compare/md-outputs..md-outputs-PR-489

The following changes were observed in the rendered markdown documents:

 12-cluster.md                                      |   3 +-
 13-scheduler.md                                    |  10 +-
 14-environment-variables.md                        |   2 +-
 15-modules.md                                      |   3 +-
 17-parallel.md                                     |  12 +--
 18-resources.md                                    |   8 +-
 config.yaml (gone)                                 | 119 ---------------------
 .../Ghastly_Mistakes/_config_options.yml           |   8 +-
 .../HPCC_MagicCastle_slurm/_config_options.yml     |  23 +++-
 .../snippets/array/analyzed_test-data.Rmd (new)    |  33 ++++++
 .../snippets/array/job_array_word-freq.Rmd (new)   |  22 ++++
 .../snippets/array/job_single_word-freq.Rmd (new)  |  12 +++
 .../NCL_Comet_slurm/_config_options.yml (new)      |  41 +++++++
 .../snippets/modules/available-modules.Rmd (new)   |  24 +++++
 .../modules/python-executable-dir.Rmd (new)        |   9 ++
 .../modules/software-dependencies.Rmd (new)        |  63 +++++++++++
 md5sum.txt                                         |   8 +-
 17 files changed, 252 insertions(+), 148 deletions(-)
What does this mean?

If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible.

This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation.

⏱️ Updated at 2026-04-07 19:43:03 +0000

@tkphd tkphd marked this pull request as draft April 2, 2026 20:24
@tkphd tkphd requested review from jsteyn, ocaisa and reid-a April 2, 2026 20:24
github-actions bot pushed a commit that referenced this pull request Apr 2, 2026
@reid-a
Copy link
Copy Markdown
Contributor

reid-a commented Apr 5, 2026

Edit: Having spent some quality time with load_config.R, this is now clear, it does inherit from the Magic Castle set-up.

My prior confusion:

I recognize that this is WIP, but looking around, I'm a bit confused about what's going on. It looks like the hierarchical scheme is not yet set up? E.g. looking for the definition of the "tasks" variable, it's present in episodes/files/customization/HPCC_MagicCastle_slurm/_config_options.yml, but not in either of the other customization branches (Ghastly_Mistakes and NCL_Comet_slurm ), both of which also have _config_options.yml files.

Is the intention that these should inherit from HPCC_MagicCastle (which is a peer directory of the others, but also the default, so maybe weird but defensible) for from a top-level config?

@reid-a
Copy link
Copy Markdown
Contributor

reid-a commented Apr 5, 2026

Some additional points, while experimenting with the different possible configs, I ran into an issue that the workflow is not really meant for incremental builds -- when you switch the environment variable and rebuild, the R machinery will note that there are no changes to the source, and decline to rebuild things, so it will not re-localize the targets.

One fix is to just blitz the whole /site directory -- probably overkill, but it works for testing.

@reid-a
Copy link
Copy Markdown
Contributor

reid-a commented Apr 7, 2026

So, focusing in on the actual subject, it looks like starting point is that the array lesson (/epsidoes/21-array.Rmd) is currently using the base configuration's snippet for the job array submission file, which is templated, and even has a solution for optional variables, e.g. the snippet file checks to see if the account variable is present, and renders it if it is and not if it isn't.

But it's not rendering correctly? If I un-comment-out the array lesson from the top-level config.yml and build with HPC_CARPENTRY_CUSTOMIZATION=episodes/files/customization/NCL_Comet_slurm/_config_options.yml, then the array lesson does not seem to correctly incorporate the account setting -- the #SBATCH --account=comet_training is rendered outside of the bash block.

A bit of printf debugging shows that the conditional on line 4 of job_single_word-freq.Rmd is generating a blank line, but I can't figure out where the line above the bash block is coming from? It looks like a footnote, so maybe it's coming from somewhere higher up in the file?

Anyways, the blank render from the conditional seems to indicate that either the NCL_Comet overlay that sets the config$sched$account string is not working, or that it is but the conditional in job_single_word-freq.Rmd is not.

@reid-a
Copy link
Copy Markdown
Contributor

reid-a commented Apr 7, 2026

Also of note, the Magic Castle config file in epsides/files/customization/HPCC_MagicCastle_slurm/_config_options.yml contains a lesson-ordering block, but this does not seem to be operative, I was able to include the array lesson by uncommenting it from the top-level config.yml file, and it's not present in the Magic Castle custom config.

This is an incomplete test, it's possible that additions are effective here, but not deletions.

github-actions bot pushed a commit that referenced this pull request Apr 7, 2026
@tkphd
Copy link
Copy Markdown
Contributor Author

tkphd commented Apr 7, 2026

Thank you for the commentary as you grok changes, @reid-a.

the array lesson (/epsidoes/21-array.Rmd) is currently using the base configuration's snippet for the job array submission file, which is templated and even has a solution for optional variables, e.g. the snippet file checks to see if the account variable is present, and renders it if it is and not if it isn't.

Correct! The specification for episodes/load_config.R was to load the snippet from the custom library if present, but default to MagicCastle if absent. I am pleased that, having purged unmodified snippets from the NCL library, the script functions as intended. Thank you, @ocaisa!!!

But it's not rendering correctly? If I un-comment-out the array lesson from the top-level config.yml and build with HPC_CARPENTRY_CUSTOMIZATION=episodes/files/customization/NCL_Comet_slurm/_config_options.yml, then the array lesson does not seem to correctly incorporate the account setting -- the #SBATCH --account=comet_training is rendered outside of the bash block.

Ope! My bad. I wrapped my brain in a pretzel, segfaulted, and snapped back to reality. It's fixed now.

@tkphd
Copy link
Copy Markdown
Contributor Author

tkphd commented Apr 7, 2026

epsides/files/customization/HPCC_MagicCastle_slurm/_config_options.yml contains a lesson-ordering block, but this does not seem to be operative

Correct! It appears the lesson ordering variable is untouched by episodes/load_config.R, or the values have already been propagated into the machinery before the updated sequence is loaded from the custom _config_options.yml.

I was able to include the array lesson by uncommenting it from the top-level config.yml file, and it's not present in the Magic Castle custom config.

Yep, that seems to be the way of it.

This is an incomplete test, it's possible that additions are effective here, but not deletions.

I believe this is true. We would want to replace the episode ordering entirely, not attempt to reconcile the default vs. custom lists.

@tkphd
Copy link
Copy Markdown
Contributor Author

tkphd commented Apr 7, 2026

Some additional points, while experimenting with the different possible configs, I ran into an issue that the workflow is not really meant for incremental builds -- when you switch the environment variable and rebuild, the R machinery will note that there are no changes to the source, and decline to rebuild things, so it will not re-localize the targets.

One fix is to just blitz the whole /site directory -- probably overkill, but it works for testing.

Yeah, when the lesson fails to build or sees no reason to rebuild, {sandpaper} can just trip all over itself on the next attempt. I've been invoking the following in R, often:

servr::daemon_stop(); sandpaper::reset_site(); sandpaper::serve()

@tkphd
Copy link
Copy Markdown
Contributor Author

tkphd commented Apr 7, 2026

Is the intention that these should inherit from HPCC_MagicCastle (which is a peer directory of the others, but also the default, so maybe weird but defensible) for from a top-level config?

Correct. The Slurm flag names need to be defined only once, so they are in the MagicCastle library. The flag values are defined in the custom libraries.
NCL's Comet customizations clarified this with a corner case: MagicCastle jobs do not require an account, but Comet jobs do. To template the job script snippets, Comet needs to know both the flag and value for "account," while MagicCastle needed to be oblivious. So, the default config includes all of the flags invoked by any custom library (config$sched$flag entries, eg. config$sched$flag$account), while only the custom libraries define values (config$sched$<<variable>> entries, eg. config$sched$account).

Clear as mud?

@reid-a
Copy link
Copy Markdown
Contributor

reid-a commented Apr 9, 2026

I think I am past the build issues, and can focus on some of the content.

With respect to the question of the relationship of --task and --array, my reading of the SLURM documentation is that each array element uses all the other arguments in the sbatch file, so the directives in the job_array_word-req.Rmd dispatches four array elements, with four tasks each, for a total of 16. Since the actual executable is serial, I don't see the point of this, and if we do want multiple tasks for some reason which escapes me, having both the task number and array size be the same number invites confusion.

I also endorse the TODO items beginning on line 163 of episodes/21-array.Rmd. With respect to the cognitive burden in particular, it could be lowered if we used the amdahl executable instead? The pros are, we would only be setting up one executable, and it would help provide a through-line to the lesson, which might help retention. The cons are that amdahl is not really built for this -- the obvious non-array point one could perhaps make with this scheme is to ensure that amdahl is run with the jitter turned on, examine the differences between the array element timings, and use them to reinforce the point that the system is shared, jobs interact with the system in unpredictable ways, and that timings can vary for that reason. There's some cognitive burden to that too, of course, but it's aligned with other lesson elements, which is attractive to me.

@tkphd
Copy link
Copy Markdown
Contributor Author

tkphd commented Apr 15, 2026

With respect to the question of the relationship of --ntasks and --array, my reading of the SLURM documentation is that each array element uses all the other arguments in the sbatch file, so the directives in the job_array_word-req.Rmd dispatches four array elements, with four tasks each, for a total of 16. Since the actual executable is serial, I don't see the point of this, and if we do want multiple tasks for some reason which escapes me, having both the task number and array size be the same number invites confusion.

Having reviewed the docs, this appears true: most JobArray examples specify --array=a..b and show one task spawned per array job $a\in\{x_0, x_1\}$ in serial, for $(x_1-x_0)$ total tasks.

Our scripts launch amdahl serially, so we could explicitly state --ntasks=1 or eliminate the specification, equivalently.

We can also launch amdahl in parallel, as a job array, to simultaneously spawn a bunch of identical parallel jobs. This could be useful for examining the effects of jitter on runtime, for example, giving this episode an actual challenge problem -- after working through the purely serial edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants