Skip to content

Commit 34cc265

Browse files
committed
update README, pre-merge cleanup
1 parent 1445b6a commit 34cc265

File tree

2 files changed

+32
-8
lines changed

2 files changed

+32
-8
lines changed

README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,31 @@ bash setup.bash
4444
```
4545
NOTE: For path invariance, the setup script will automatically move the cloned repo to your home directory (<code>/home/$USER</code>)
4646

47+
---
48+
<h3> TLDR </h3>
49+
To launch a job on the cluster, consider the following case:
50+
51+
The file <code>setup.bash</code> has been executed and all paths have been setup correctly (Note that at this point, this repository will be located in <code>/home/$USER</code>). The script file <code>file.py</code>, located in directory <code>/home/$USER/test</code> is to be executed in a conda environment named <code>env</code>.
52+
53+
Let us assume that the script requires 2 GPU cards, and as many CPUs as possible (14*N_GPU=28. For info on why 28, read the help message of joblauncher.bash)
54+
55+
Let us also assume that the script is to be run on the "cocosys" partition using the "cocosys" queue and that the user estimates a max runtime of 2.5 days. Additionally, the user determines that, in case the job runs for longer than 2.5 days, the necessary checkpoint and metadata saving will take ~97 seconds.
56+
57+
Given these considerations, the job should be launched using the following command:
58+
59+
```
60+
bash joblauncher.bash -j jobsubmissionscript.sub -t python -d ~/test/ -f file.py -e env -g 2 -c 28 -q cocosys -p cocosys -T 2-12:00:00 -s 97
61+
```
62+
63+
If email updates of job status are desired, then the <code>-m</code> flag should be added to the previous command.
64+
65+
A detailed list of command line args accepted by this script (or any of the scripts in this repo) may be found using
66+
```
67+
bash joblauncher.bash -h
68+
```
4769
---
4870
<h3> Scripts </h3>
71+
4972
<h4> Setup Conda </h4>
5073
RCAC clusters require use of the IT-managed conda module loadable using Lmod. While installing conda locally in your own directory <code>/home/$USER/</code> is possible, environments installed using your own conda installation will not be importable in code, i.e., they will not work.
5174

@@ -98,6 +121,12 @@ A Job Submission Script is supposed to do three main things:
98121
More information on job submission scripts, specifically for RCAC clusters, may be found <a href="https://www.rcac.purdue.edu/knowledge/gautschi/run/slurm/script">here</a>.
99122

100123
<h4> Launching a Job </h4>
124+
To abstract out the details of the <code>sbatch</code> and <code>srun</code> SLURM commands, this repo provides a util script <code>joblauncher.bash</code>. A detailed list of command line args accepted by this script may be found using
125+
126+
```
127+
bash joblauncher.bash -h
128+
```
129+
101130
Jobs are launched using either the <code>srun</code> or the <code>sbatch</code> commands. Both these commands accept the same set of parameters. The main difference is that <code>srun</code> is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while <code>sbatch</code> is batch processing and non-blocking (results are written to a file and you can submit other commands right away).
102131

103132
If you use <code>srun</code> in the background with the & sign, then you remove the 'blocking' feature of <code>srun</code>, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the <code>srun</code> processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.
@@ -199,4 +228,7 @@ To cancel a running/pending job, use
199228
```
200229
scancel $JOB_ID
201230
```
231+
---
232+
<h3> Common Pitfalls </h3>
233+
Take a look at the issues for answers to the most common pitfalls. If your problem does not appear there, do raise a fresh issue!
202234

joblauncher.bash

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -115,14 +115,6 @@ if [[ $N_GPUS -gt 0 ]] && [[ $((${CLUSTER}"_gpu_"${PARTITION})) -eq 0 ]]; then
115115
exit 1
116116
fi
117117

118-
# GPUS_PER_NODE=$N_GPUS
119-
# MAX_GPUS_PER_NODE=$((${CLUSTER}"_gpu_"${PARTITION}))
120-
# N_GPU_NODES=$(((($N_GPUS+$MAX_GPUS_PER_NODE-1))/$MAX_GPUS_PER_NODE))
121-
# if [[ $N_GPUS -gt $MAX_GPUS_PER_NODE ]]; then
122-
# DIV=$(( $N_NODES > $N_GPU_NODES ? $N_NODES : $N_GPU_NODES ))
123-
# GPUS_PER_NODE=$(($N_GPUS/$DIV))
124-
# fi
125-
126118
# essential computation
127119
DIV=$((${CLUSTER}"_cpu_"${PARTITION}))
128120
N_NODES=$(((($N_CPUS+$DIV-1))/$DIV))

0 commit comments

Comments
 (0)