Skip to content

Commit c87f4a4

Browse files
committed
Clean up hpc basics section and add a gams project for tutorial
1 parent 2b4075a commit c87f4a4

6 files changed

Lines changed: 283 additions & 96 deletions

File tree

docs/source/examples/gams_on_hpc.rst

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,14 @@ Running a Spine Toolbox project on an HPC
106106

107107
Line endings in Slurm scripts such as `run_on_hpc.sh` must be Unix style (LF).
108108

109+
**Key parameters**
110+
- ``--job-name``: Job name
111+
- ``--time``: Maximum runtime
112+
- ``--cpus-per-task``: CPU cores
113+
- ``--mem``: Memory
114+
- ``--output``: Output file
115+
- ``--error``: Error file
116+
109117
6. Edit the Slurm script by adding the license
110118
7. Submit job to Slurm Scheduler
111119

@@ -171,6 +179,12 @@ This command returns something like:
171179
172180
watch -n 2 squeue -u $USER
173181
182+
Another option is to use `tail`:
183+
184+
.. code-block:: bash
185+
186+
tail -f out.txt
187+
174188
Again, if $USER is not defined, replace it with your user name. This function tails the job progress and updates
175189
every two seconds.
176190

@@ -377,6 +391,43 @@ File Not Found
377391
378392
echo $PWD
379393
394+
Job Stuck in Queue
395+
++++++++++++++++++
396+
397+
- Cluster is full
398+
- Resource request too large
399+
400+
Memory Errors
401+
+++++++++++++
402+
403+
Increase memory:
404+
405+
.. code-block:: bash
406+
407+
#SBATCH --mem=16G
408+
409+
410+
Solver Not Found
411+
++++++++++++++++
412+
413+
.. code-block:: bash
414+
415+
module load gurobi
416+
417+
Check installation:
418+
419+
.. code-block:: bash
420+
421+
which gurobi_cl
422+
423+
424+
Python Module Missing
425+
+++++++++++++++++++++
426+
427+
.. code-block:: bash
428+
429+
pip install pyomo
430+
380431
Module Not Found
381432
++++++++++++++++
382433

docs/source/examples/hpc_basics.rst

Lines changed: 113 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ Most HPC systems follow a similar architecture composed of three main components
2828

2929
Conceptual structure of an HPC cluster
3030

31-
Users typically interact only with the login node. After a job is submitted, the scheduler assigns it to available compute nodes, where it runs—often in parallel across multiple CPUs or nodes.
31+
Users typically interact only with the login node. After a job is submitted, the scheduler assigns it to available
32+
compute nodes, where it runs—often in parallel across multiple CPUs or nodes.
3233

3334
.. warning::
3435

@@ -52,9 +53,7 @@ This separation allows HPC systems to efficiently share resources among many use
5253
:width: 100%
5354
:align: center
5455

55-
Overview of the HPC workflow for running an energy optimization model.
56-
The user prepares and submits a job from a local machine, which is scheduled
57-
and executed on compute nodes. Results are then retrieved back to the local environment.
56+
Overview of the HPC workflow
5857

5958
*************
6059
Job Lifecycle
@@ -66,7 +65,8 @@ When a job is submitted to the scheduler (such as Slurm), it goes through severa
6665
- **RUNNING**: the job is actively executing on compute nodes
6766
- **COMPLETED**: the job has finished successfully (or terminated with an error)
6867

69-
Understanding these states helps explain why jobs may not start immediately—waiting in the queue is normal and depends on factors such as resource availability and system load.
68+
Understanding these states helps explain why jobs may not start immediately—waiting in the queue is normal and
69+
depends on factors such as resource availability and system load.
7070

7171
.. figure:: ../img/tutorials/job_lifecycle.png
7272
:width: 100%
@@ -83,163 +83,180 @@ Connecting to the Cluster
8383
Login
8484
-----
8585

86+
You need access rights (username/password) for your HPC cluster, which you should request from the HPC administrator.
87+
Once you have the necessary rights, you can make an SSH connection from your terminal to the HPC login node.
8688

8789
.. code-block:: bash
8890
8991
ssh username@cluster.address
9092
93+
On Windows, we recommend using PuTTY as an SSH client. You can install it from the Windows
94+
Store or from the `PuTTY site <https://putty.software/>`_.
9195

9296
File Transfer
9397
-------------
9498

99+
You can use scp to transfer files between your local system and the login node.
100+
95101
.. code-block:: bash
96102
97103
scp -r my_project/ username@cluster:/home/username/
98104
99-
or:
100-
101-
.. code-block:: bash
102-
103-
rsync -avz my_project/ username@cluster:/home/username/my_project/
105+
However, in the long run this may become tedious, so it is recommended that you use for example
106+
`WinSCP <https://winscp.net/>`_, which makes file transfers quicker by providing an easy to use drag-and-drop UI.
104107

105108
*************************************
106109
Understanding the Cluster Environment
107110
*************************************
108111

109-
Common Directories
110-
------------------
112+
Working on an HPC cluster differs from working on a local machine. The filesystem is typically distributed and
113+
shared across compute nodes, and different directories are designed for specific purposes. Using them correctly
114+
is essential for performance, data safety, and efficient workflows.
111115

112-
- ``$HOME``: persistent storage
113-
- ``$SCRATCH``: fast temporary storage
114-
115-
116-
Module System
117-
-------------
116+
Common Directory Types
117+
----------------------
118118

119-
.. code-block:: bash
119+
Running applications on a cluster requires that all compute nodes involved in a job can access the same files.
120+
This is usually achieved through a shared parallel filesystem. While directory names vary between systems,
121+
most clusters provide the following *types* of storage:
120122

121-
module avail
122-
module list
123+
- ``$HOME`` (or home directory)
123124

124-
**************************
125-
Writing a Slurm Job Script
126-
**************************
125+
Your personal home directory. This is **persistent storage**, meaning files are kept long-term and often backed up.
126+
However, it is typically **not optimized for heavy I/O workloads**, so it should mainly be used for:
127127

128-
Create ``job.sh``:
128+
- Source code
129+
- Configuration files
130+
- Small input datasets
131+
- Scripts and job submission files
129132

130-
.. code-block:: bash
133+
- High-performance temporary storage (often called ``$SCRATCH`` or similar)
131134

132-
#!/bin/bash
133-
#SBATCH --job-name=energy_model
134-
#SBATCH --output=output.log
135-
#SBATCH --error=error.log
136-
#SBATCH --time=02:00:00
137-
#SBATCH --cpus-per-task=4
138-
#SBATCH --mem=8G
135+
A fast storage area intended for **temporary data and intensive I/O operations**.
136+
The exact location and name vary by system. Examples include:
139137

140-
module load python
141-
module load gurobi
138+
- ``$SCRATCH`` (if defined)
139+
- ``/scratch``
140+
- ``/tmp``
141+
- ``/jobs`` (on some systems)
142142

143-
source venv/bin/activate
143+
Consult your system documentation to find the correct path.
144144

145-
python run_model.py
145+
Use this storage for:
146146

147+
- Large simulation outputs
148+
- Intermediate data
149+
- Temporary working files
147150

148-
Key Parameters
149-
--------------
151+
.. note::
152+
These locations are usually not backed up and may be cleaned automatically
153+
after a retention period. Always copy important data to persistent storage.
150154

151-
- ``--job-name``: Job name
152-
- ``--time``: Maximum runtime
153-
- ``--cpus-per-task``: CPU cores
154-
- ``--mem``: Memory
155-
- ``--output``: Output file
156-
- ``--error``: Error file
155+
- Project or shared storage (often called ``$PROJECT`` or ``$WORK``)
157156

157+
A shared directory intended for **collaborative work within a project or research group**.
158+
Not all systems provide this, but when available it typically offers more space than ``$HOME``
159+
and longer retention than temporary storage.
158160

159-
Submitting the Job
160-
------------------
161+
Common uses include:
161162

162-
.. code-block:: bash
163+
- Shared datasets
164+
- Group software installations
165+
- Results that need to be preserved longer-term
163166

164-
sbatch job.sh
167+
.. note::
168+
If your system does not provide a dedicated project directory, you may need
169+
to manage shared data manually in agreed-upon locations.
165170

171+
Best Practices
172+
--------------
166173

167-
Monitoring the Job
168-
------------------
174+
Because directory names vary between systems, always adapt these guidelines to the paths provided on your cluster:
169175

170-
Check queue:
176+
- Copy input data from persistent storage (e.g. ``$HOME`` or project space) to temporary high-performance storage
177+
before running jobs.
178+
- Perform all **compute-intensive tasks** using the high-performance temporary storage (e.g. ``$SCRATCH`` or ``/jobs``).
179+
- Regularly clean up unnecessary files from temporary storage.
180+
- Avoid running large jobs directly from your home directory.
171181

172-
.. code-block:: bash
182+
Example Workflow
183+
----------------
173184

174-
squeue -u username
185+
A typical workflow might look like:
175186

176-
Job details:
187+
1. Prepare input files and scripts in your home directory (``$HOME``)
188+
2. Copy necessary data to the system's temporary work directory (e.g. ``$SCRATCH`` or ``/jobs``)
189+
3. Run the job using the scheduler
190+
4. Copy final results back to persistent storage (``$HOME`` or project space)
191+
5. Clean up temporary data
177192

178-
.. code-block:: bash
193+
This approach ensures efficient use of cluster resources while keeping your data safe and organized. Next section
194+
provides an actual example workflow for executing Spine Toolbox projects in an HPC environment.
179195

180-
scontrol show job JOBID
196+
Module System
197+
-------------
181198

182-
View logs:
199+
Environment modules is a system tool to manage the shell environment. It makes it easier to handle the shell
200+
environment when there are e.g. multiple versions of the same software installed. You can get a list of available
201+
modules with command
183202

184203
.. code-block:: bash
185204
186-
tail -f output.log
187-
188-
***********************
189-
Debugging Common Issues
190-
***********************
205+
module avail
191206
192-
Job Stuck in Queue
193-
------------------
207+
The needed module can be loaded with the command
194208

195-
- Cluster is full
196-
- Resource request too large
209+
.. code-block:: bash
197210
198-
Memory Errors
199-
-------------
211+
module load <module_name>
200212
201-
Increase memory:
213+
A module can be unloaded with the command
202214

203215
.. code-block:: bash
204216
205-
#SBATCH --mem=16G
217+
module unload <module_name>
206218
219+
You can load any modules you need in a Slurm script with the `module load` command.
207220

208-
Solver Not Found
209-
----------------
210-
211-
.. code-block:: bash
212-
213-
module load gurobi
221+
Slurm - Basic Commands
222+
----------------------
214223

215-
Check installation:
224+
These basic commands should be available in the HPC login node to help you get started. There are many more
225+
commands available, please also read your HPC documentation if there are special circumstances you should be
226+
aware of when using Slurm.
216227

217-
.. code-block:: bash
228+
sinfo
229+
*****
218230

219-
which gurobi_cl
231+
This command gives a quick overview of the cluster status. It shows the status of different partitions, time
232+
limits, and available nodes.
220233

234+
squeue
235+
******
221236

222-
Python Module Missing
223-
---------------------
237+
This command outputs information of all queues on all partitions. You can filter the output by partition with the
238+
**-p** argument, or by username with the **-u** argument. For example, `squeue -u <username>` prints the queued
239+
jobs of user <username>.
224240

225-
.. code-block:: bash
241+
sbatch
242+
******
226243

227-
pip install pyomo
244+
This command will submit a job script to the queue. For example, `sbatch job.sh`. Please make sure that the batch
245+
script file uses **Unix (LF)** line endings.
228246

247+
scancel
248+
*******
229249

230-
**************
231-
Best Practices
232-
**************
250+
This command cancels a job. It needs a job id as an argument. For example, `scancel 10346`.
233251

234-
Always:
252+
srun
253+
****
235254

236-
- Test locally first
237-
- Run small cases
238-
- Use version control
239-
- Log outputs
255+
This command runs a single command on Slurm. It needs the same arguments for resource reservation as `sbatch`.
256+
For example `srun -J jobname --mem=64000 --pty -n 36 -p large36 /bin/bash`.
240257

241-
Avoid:
258+
scontrol show job
259+
*****************
242260

243-
- Running on login node
244-
- Over-requesting resources
245-
- Excessive file writing
261+
This command outputs detailed information about a job. It needs job id as an argument. For example,
262+
`scontrol show job 10345`.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
a,10
2+
b,20
3+
c,30

0 commit comments

Comments
 (0)