Skip to content

Commit 4d6af99

Browse files
authored
HPC Platform docs (#122)
1 parent 2ebbf45 commit 4d6af99

File tree

10 files changed

+492
-46
lines changed

10 files changed

+492
-46
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
# path that contains html generated by `mkdocs build`
22
site
3+
4+
*.sw[nopq]

docs/alps/platforms.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,17 @@ A platform can consist of one or multiple [clusters][ref-alps-clusters], and its
77

88
<div class="grid cards" markdown>
99

10-
- :fontawesome-solid-mountain: __Machine Learning Platform__
10+
- :fontawesome-solid-mountain: __HPC Platform__
1111

12-
The Machine Learning Platform (MLP) hosts ML and AI researchers.
12+
The HPC Platform (HPCP) provides services for the HPC community in Switzerland and abroad. The majority of compute cycles are provided to the [User Lab](https://www.cscs.ch/user-lab/overview) via peer-reviewed allocation schemes.
1313

14-
[:octicons-arrow-right-24: MLP][ref-platform-mlp]
14+
[:octicons-arrow-right-24: HPCP][ref-platform-hpcp]
1515

16-
- :fontawesome-solid-mountain: __HPC Platform__
16+
- :fontawesome-solid-mountain: __Machine Learning Platform__
1717

18-
!!! todo
18+
The Machine Learning Platform (MLP) hosts ML and AI researchers, particularly the SwissAI initiative.
1919

20-
[:octicons-arrow-right-24: HPCP][ref-platform-hpcp]
20+
[:octicons-arrow-right-24: MLP][ref-platform-mlp]
2121

2222
- :fontawesome-solid-mountain: __Climate and Weather Platform__
2323

docs/clusters/daint.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,191 @@
11
[](){#ref-cluster-daint}
22
# Daint
3+
4+
Daint is the main [HPC Platform][ref-platform-hpcp] cluster that provides compute nodes and file systems for GPU-enabled workloads.
5+
6+
## Cluster specification
7+
8+
### Compute nodes
9+
10+
Daint consists of around 800-1000 [Grace-Hopper nodes][ref-alps-gh200-node].
11+
12+
The number of nodes can vary as nodes are added or removed from other clusters on Alps.
13+
14+
There are four login nodes, `daint-ln00[1-4]`.
15+
You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and launch batch jobs.
16+
17+
| node type | number of nodes | total CPU sockets | total GPUs |
18+
|-----------|-----------------| ----------------- | ---------- |
19+
| [gh200][ref-alps-gh200-node] | 1,022 | 4,088 | 4,088 |
20+
21+
### Storage and file systems
22+
23+
Daint uses the [HPCP filesystems and storage policies][ref-hpcp-storage].
24+
25+
## Getting started
26+
27+
### Logging into Daint
28+
29+
To connect to Daint via SSH, first refer to the [ssh guide][ref-ssh].
30+
31+
!!! example "`~/.ssh/config`"
32+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Daint using `ssh daint`.
33+
```
34+
Host daint
35+
HostName daint.alps.cscs.ch
36+
ProxyJump ela
37+
User cscsusername
38+
IdentityFile ~/.ssh/cscs-key
39+
IdentitiesOnly yes
40+
```
41+
42+
### Software
43+
44+
[](){#ref-cluster-daint-uenv}
45+
#### uenv
46+
47+
Daint provides uenv to deliver programming environments and application software.
48+
Please refer to the [uenv documentation][ref-uenv] for detailed information on how to use the uenv tools on the system.
49+
50+
<div class="grid cards" markdown>
51+
52+
- :fontawesome-solid-layer-group: __Scientific Applications__
53+
54+
Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own versions of the applications.
55+
56+
* [CP2K][ref-uenv-cp2k]
57+
* [GROMACS][ref-uenv-gromacs]
58+
* [LAMMPS][ref-uenv-lammps]
59+
* [NAMD][ref-uenv-namd]
60+
* [Quantumespresso][ref-uenv-quantumespresso]
61+
* [VASP][ref-uenv-vasp]
62+
63+
</div>
64+
65+
<div class="grid cards" markdown>
66+
67+
- :fontawesome-solid-layer-group: __Programming Environments__
68+
69+
Provide compilers, MPI, Python, common libraries and tools used to build your own applications.
70+
71+
* [prgenv-gnu][ref-uenv-prgenv-gnu]
72+
* [prgenv-nvfortran][ref-uenv-prgenv-nvfortran]
73+
* [linalg][ref-uenv-linalg]
74+
* [julia][ref-uenv-julia]
75+
</div>
76+
77+
<div class="grid cards" markdown>
78+
79+
- :fontawesome-solid-layer-group: __Tools__
80+
81+
Provide tools like
82+
83+
* [Linaro Forge][ref-uenv-linaro]
84+
</div>
85+
86+
[](){#ref-cluster-daint-containers}
87+
#### Containers
88+
89+
Daint supports container workloads using the [container engine][ref-container-engine].
90+
91+
To build images, see the [guide to building container images on Alps][ref-build-containers].
92+
93+
#### Cray Modules
94+
95+
!!! warning
96+
The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS.
97+
98+
CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity.
99+
100+
The CPE is still installed on Daint, however it will receive no support or updates, and will be replaced with a container in a future update.
101+
102+
## Running jobs on Daint
103+
104+
### Slurm
105+
106+
Daint uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor compute-intensive workloads.
107+
108+
There are four [Slurm partitions][ref-slurm-partitions] on the system:
109+
110+
* the `normal` partition is for all production workloads.
111+
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
112+
* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal].
113+
* the `low` partition is a low-priority partition, which may be enabled for specific projects at specific times.
114+
115+
116+
117+
| name | nodes | max nodes per job | time limit |
118+
| -- | -- | -- | -- |
119+
| `normal` | unlim | - | 24 hours |
120+
| `debug` | 24 | 2 | 30 minutes |
121+
| `xfer` | 2 | 1 | 24 hours |
122+
| `low` | unlim | - | 24 hours |
123+
124+
* nodes in the `normal` and `debug` (and `low`) partitions are not shared
125+
* nodes in the `xfer` partition can be shared
126+
127+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
128+
129+
### FirecREST
130+
131+
Daint can also be accessed using [FirecREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint.
132+
133+
!!! warning "The FirecREST v1 API is still available, but deprecated"
134+
135+
## Maintenance and status
136+
137+
### Scheduled maintenance
138+
139+
!!! todo "move this to HPCP top level docs"
140+
Wednesday mornings 8:00-12:00 CET are reserved for periodic updates, with services potentially unavailable during this time frame. If the batch queues must be drained (for redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
141+
142+
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
143+
144+
### Change log
145+
146+
!!! change "2025-05-21"
147+
Minor enhancements to system configuration have been applied.
148+
These changes should reduce the frequency of compute nodes being marked as `NOT_RESPONDING` by the workload manager, while we continue to investigate the issue
149+
150+
!!! change "2025-05-14"
151+
??? note "Performance hotfix"
152+
The [access-counter-based memory migration feature](https://developer.nvidia.com/blog/cuda-toolkit-12-4-enhances-support-for-nvidia-grace-hopper-and-confidential-computing/#access-counter-based_migration_for_nvidia_grace_hopper_memory) in the NVIDIA driver for Grace Hopper is disabled to address performance issues affecting NCCL-based workloads (e.g. LLM training)
153+
154+
??? note "NVIDIA boost slider"
155+
Added an option to enable the NVIDIA boost slider (vboost) via Slurm using the `-C nvidia_vboost_enabled` flag.
156+
This feature, disabled by default, may increase GPU frequency and performance while staying within the power budget
157+
158+
??? note "Enroot update"
159+
The container runtime is upgraded from version 2.12.0 to 2.13.0. This update includes libfabric version 1.22.0 (previously 1.15.2.0), which has demonstrated improved performance during LLM checkpointing
160+
161+
!!! change "2025-04-30"
162+
??? note "uenv is updated from v7.0.1 to v8.1.0"
163+
* improved uenv view management
164+
* automatic generation of default uenv repository the first time uenv is called
165+
* configuration files
166+
* bash completion
167+
* relative paths can be used for referring to squashfs images
168+
* support for `SLURM_UENV` and `SLURM_UENV_VIEW` environment variables (useful for using inside CI/CD pipelines)
169+
* better error messages and small bug fixes
170+
171+
??? note "Pyxis is upgraded from v24.5.0 to v24.5.3"
172+
* Added image caching for Enroot
173+
* Added support for environment variable expansion in EDFs
174+
* Added support for relative paths expansion in EDFs
175+
* Print a message about the experimental status of the --environment option when used outside of the srun command
176+
* Merged small features and bug fixes from upstream Pyxis releases v0.16.0 to v0.20.0
177+
* Internal changes: various bug fixes and refactoring
178+
179+
??? change "2025-03-12"
180+
1. The number of compute nodes has been increased to 1018
181+
1. The restriction on the number of running jobs per project has been lifted.
182+
1. A "low" priority partition has been added, which allows some project types to consume up to 130% of the project's quarterly allocation
183+
1. We have increased the power cap for the GH module from 624 to 660 W. You might see increased application performance as a consequence
184+
1. Small changes in kernel tuning parameters
185+
186+
### Known issues
187+
188+
!!! todo
189+
Most of these issues (see original [KB docs](https://confluence.cscs.ch/spaces/KB/pages/868811400/Daint.Alps#Daint.Alps-Knownissues)) should be consolidated in a location where they can be linked to by all clusters.
190+
191+
We have some "know issues" documented under [communication libraries][ref-communication-cray-mpich], however these might be a bit too disperse for centralised linking.

0 commit comments

Comments
 (0)