@@ -53,8 +53,8 @@ For instance, the name :code:`1234.5` refers to step id 5 of job id 1234.
5353On ALPS, each job step within an allocation has a unique id that can be obtained
5454through :code: `apstat `.
5555
56- Ignoring node failures
57- ----------------------
56+ Tolerating node failures
57+ ------------------------
5858
5959Before running an SCR job, it is recommended to configure the job allocation to withstand node failures.
6060By default, most resource managers terminate the job allocation if a node fails,
@@ -65,12 +65,12 @@ one must specify the appropriate flags from the table below.
6565SCR job allocation flags
6666
6767================== ================================================================
68+ LSF batch script :code: `#BSUB -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" `
69+ LSF interactive :code: `bsub -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" ... `
6870MOAB batch script :code: `#MSUB -l resfailpolicy=ignore `
6971MOAB interactive :code: `qsub -I ... -l resfailpolicy=ignore `
7072SLURM batch script :code: `#SBATCH --no-kill `
7173SLURM interactive :code: `salloc --no-kill ... `
72- LSF batch script :code: `#BSUB -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" `
73- LSF interactive :code: `bsub -env "all, LSB_DJOB_COMMFAIL_ACTION=KILL_TASKS" ... `
7474================== ================================================================
7575
7676The SCR wrapper script
@@ -120,9 +120,8 @@ An example SLURM batch script with :code:`scr_srun` is shown below
120120.. code-block :: bash
121121
122122 #! /bin/bash
123- # SBATCH --partition pbatch
124- # SBATCH --nodes 66
125123 # SBATCH --no-kill
124+ # SBATCH --nodes 66
126125
127126 # above, tell SLURM to not kill the job allocation upon a node failure
128127 # also note that the job requested 2 spares -- it uses 64 nodes but allocated 66
0 commit comments