Skip to content

Commit a7ac0cd

Browse files
authored
Fix Information.md and sub-pages (#315)
* Fix Information.md Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_data_transmission.md Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_monitoring.md Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_nodes.md, remove anchor Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_roles_and_resources.md Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_sharing_policies.md, remove anchor Signed-off-by: Fabrice Normandin <[email protected]> * Fix Information_storage.md, remove anchor Signed-off-by: Fabrice Normandin <[email protected]> --------- Signed-off-by: Fabrice Normandin <[email protected]>
1 parent 8b8b15e commit a7ac0cd

7 files changed

+59
-70
lines changed

docs/Information.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
1-
## Computing infrastructure and policies
1+
# Computing infrastructure and policies
22

33
This section seeks to provide factual information and policies on the Mila cluster computing environments.
44

55
<!--nav-->
66
* [Roles and authorizations](Information_roles_and_resources.md)
7-
<!-- * [Overview of available computing resources at Mila](Information_computing_resources.md) -->
8-
* [Node profile description](Information_node_profiles.md)
7+
* [Node profile description](Information_nodes.md)
98
* [Storage](Information_storage.md)
109
* [Data sharing policies](Information_sharing_policies.md)
1110
* [Data Transmission](Information_data_transmission.md)

docs/Information_data_transmission.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
### Data Transmission
2-
1+
# Data Transmission
32

43
Multiple methods can be used to transfer data to/from the cluster:
54

docs/Information_monitoring.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
### Monitoring
1+
# Monitoring
22

33

44
Every compute node on the Mila cluster has a [Netdata](https://www.netdata.cloud/)
@@ -26,7 +26,7 @@ of the Mila cluster and to sound the alarm if outages occur
2626
(e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM).
2727

2828

29-
#### Example with Netdata on cn-c001
29+
## Example with Netdata on cn-c001
3030

3131

3232
For example, if we have a job running on `cn-c001`, we can type
@@ -36,7 +36,7 @@ page will appear.
3636
![monitoring](monitoring.png)
3737

3838

39-
#### Example watching the CPU/RAM/GPU usage
39+
## Example watching the CPU/RAM/GPU usage
4040

4141

4242
Given that compute nodes are generally shared
@@ -95,7 +95,7 @@ make sure that this resources is always kept busy.
9595
![monitoring_users](monitoring_users.png)
9696

9797

98-
#### Example with Mila dashboard
98+
## Example with Mila dashboard
9999

100100

101101
![mila dashboard](mila_dashboard_2021-06-15.png)

docs/Information_nodes.md

Lines changed: 28 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,38 @@
1-
### Node profile description
2-
3-
4-
<a id="node_list"></a>
1+
# Node profile description
52

63
<!-- Je trouve cela un peu futile de maintenir cette documentation à jour
74
manuellement. Peut-être pourrions nous créer dans ce dossier des sripts qui
85
pourraient créer une entrée RST et qui pourraient être exécutés sur un noeud au
96
Mila pour les mises à jour. -->
7+
<!-- TODO: Maybe add the tablesort feature of mkdocs: https://squidfunk.github.io/mkdocs-material/reference/data-tables/#sortable-tables -->
8+
9+
| Name | GPU Model | Mem | # | CPUs | Sockets | Cores/Socket | Threads/Core | Memory (GB) | TmpDisk (TB) | Arch | Slurm Features |
10+
| --------------------- | --------- | --- | --- | ---- | ------- | ------------ | ------------ | ----------- | ------------ | ------ | ---------------------- |
11+
| **GPU Compute Nodes** | | | | | | | | | | | |
12+
| **cn-a[001-011]** | RTX8000 | 48 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | turing,48gb |
13+
| **cn-b[001-005]** | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | volta,nvlink,32gb |
14+
| **cn-c[001-040]** | RTX8000 | 48 | 8 | 64 | 2 | 32 | 1 | 384 | 3 | x86_64 | turing,48gb |
15+
| **cn-g[001-029]** | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 7 | x86_64 | ampere,nvlink,80gb |
16+
| **cn-i001** | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 3.6 | x86_64 | ampere,80gb |
17+
| **cn-j001** | A6000 | 48 | 8 | 64 | 2 | 32 | 1 | 1024 | 3.6 | x86_64 | ampere,48gb |
18+
| **cn-k[001-004]** | A100 | 40 | 4 | 48 | 2 | 24 | 1 | 512 | 3.6 | x86_64 | ampere,nvlink,40gb |
19+
| **cn-l[001-091]** | L40S | 48 | 4 | 48 | 2 | 24 | 1 | 1024 | 7 | x86_64 | lovelace,48gb |
20+
| **cn-n[001-002]** | H100 | 80 | 8 | 192 | 2 | 96 | 1 | 2048 | 35 | x86_64 | hopper,nvlink,80gb |
21+
| **DGX Systems** | | | | | | | | | | | |
22+
| **cn-d[001-002]** | A100 | 40 | 8 | 128 | 2 | 64 | 1 | 1024 | 14 | x86_64 | ampere,nvlink,dgx,40gb |
23+
| **cn-d[003-004]** | A100 | 80 | 8 | 128 | 2 | 64 | 1 | 2048 | 28 | x86_64 | ampere,nvlink,dgx,80gb |
24+
| **cn-e[002-003]** | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 512 | 7 | x86_64 | volta,nvlink,dgx,32gb |
25+
| **CPU Compute Nodes** | | | | | | | | | | | |
26+
| **cn-f[001-004]** | - | - | - | 32 | 1 | 32 | 1 | 256 | 10 | x86_64 | rome |
27+
| **cn-h[001-004]** | - | - | - | 64 | 2 | 32 | 1 | 768 | 7 | x86_64 | milan |
28+
| **cn-m[001-004]** | - | - | - | 96 | 2 | 48 | 1 | 1024 | 7 | x86_64 | sapphire |
29+
30+
## Special nodes and outliers
31+
32+
33+
### DGX A100
1034

1135

12-
| Name | GPU Model | Mem | # | CPUs | Sockets | Cores/Socket | Threads/Core | Memory (GB) | TmpDisk (TB) | Arch | Slurm Features |
13-
|------|-----------|-----|---|------|---------|--------------|--------------|-------------|--------------|------|---------------|
14-
| **GPU Compute Nodes** | | | | | | | | | | | |
15-
| **cn-a[001-011]** | RTX8000 | 48 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | turing,48gb |
16-
| **cn-b[001-005]** | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 384 | 3.6 | x86_64 | volta,nvlink,32gb |
17-
| **cn-c[001-040]** | RTX8000 | 48 | 8 | 64 | 2 | 32 | 1 | 384 | 3 | x86_64 | turing,48gb |
18-
| **cn-g[001-029]** | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 7 | x86_64 | ampere,nvlink,80gb |
19-
| **cn-i001** | A100 | 80 | 4 | 64 | 2 | 32 | 1 | 1024 | 3.6 | x86_64 | ampere,80gb |
20-
| **cn-j001** | A6000 | 48 | 8 | 64 | 2 | 32 | 1 | 1024 | 3.6 | x86_64 | ampere,48gb |
21-
| **cn-k[001-004]** | A100 | 40 | 4 | 48 | 2 | 24 | 1 | 512 | 3.6 | x86_64 | ampere,nvlink,40gb |
22-
| **cn-l[001-091]** | L40S | 48 | 4 | 48 | 2 | 24 | 1 | 1024 | 7 | x86_64 | lovelace,48gb |
23-
| **cn-n[001-002]** | H100 | 80 | 8 | 192 | 2 | 96 | 1 | 2048 | 35 | x86_64 | hopper,nvlink,80gb |
24-
| **DGX Systems** | | | | | | | | | | | |
25-
| **cn-d[001-002]** | A100 | 40 | 8 | 128 | 2 | 64 | 1 | 1024 | 14 | x86_64 | ampere,nvlink,dgx,40gb |
26-
| **cn-d[003-004]** | A100 | 80 | 8 | 128 | 2 | 64 | 1 | 2048 | 28 | x86_64 | ampere,nvlink,dgx,80gb |
27-
| **cn-e[002-003]** | V100 | 32 | 8 | 40 | 2 | 20 | 1 | 512 | 7 | x86_64 | volta,nvlink,dgx,32gb |
28-
| **CPU Compute Nodes** | | | | | | | | | | | |
29-
| **cn-f[001-004]** | - | - | - | 32 | 1 | 32 | 1 | 256 | 10 | x86_64 | rome |
30-
| **cn-h[001-004]** | - | - | - | 64 | 2 | 32 | 1 | 768 | 7 | x86_64 | milan |
31-
| **cn-m[001-004]** | - | - | - | 96 | 2 | 48 | 1 | 1024 | 7 | x86_64 | sapphire |
32-
33-
#### Special nodes and outliers
34-
35-
36-
##### DGX A100
37-
38-
39-
<a id="dgx_a100_nodes"></a>
40-
4136
DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each
4237
GPU has either 40 GB or 80 GB of memory, for a total of 320 GB or 640 GB per
4338
appliance. The GPUs are interconnected via 6 NVSwitches which allow for 600 GB/s

docs/Information_roles_and_resources.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
### Roles and authorizations
1+
# Roles and authorizations
22

33

44
There are mainly two types of researchers statuses at Mila :
@@ -11,15 +11,15 @@ computing cluster. See your supervisor's Mila status to know what is your own
1111
status.
1212

1313

14-
### Overview of available computing resources at Mila
14+
## Overview of available computing resources at Mila
1515

1616

1717
The Mila cluster is to be used for regular development and relatively small
1818
number of jobs (< 5). It is a heterogeneous cluster. It uses
1919
[SLURM](Userguide_running_code.md) to schedule jobs.
2020

2121

22-
#### Mila cluster versus Digital Research Alliance of Canada clusters
22+
### Mila cluster versus Digital Research Alliance of Canada clusters
2323

2424

2525
There are a lot of commonalities between the Mila cluster and the clusters from
@@ -39,7 +39,7 @@ true in times when your favorite cluster is oversubscribed, because you can
3939
easily switch over to a different one if you are used to it.
4040

4141

42-
#### Guarantees about one GPU as absolute minimum
42+
### Guarantees about one GPU as absolute minimum
4343

4444

4545
There are certain guarantees that the Mila cluster tries to honor when it comes

docs/Information_sharing_policies.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
### Data sharing policies
1+
# Data sharing policies
22

33

4-
<a id="acl_note"></a>
5-
64
!!! note
75
[`/network/scratch`](Information_storage.md#scratch) aims to support [Access
86
Control Lists
@@ -11,7 +9,7 @@
119
model checkpoints, etc.
1210

1311

14-
[`/network/projects`](Information_storage.md) aims to offer a collaborative
12+
[`/network/projects`](Information_storage.md#projects) aims to offer a collaborative
1513
space for long-term projects. Data that should be kept for a longer period than
1614
90 days can be stored in that location but first a request to [Mila's
1715
helpdesk](https://it-support.mila.quebec) has to be made to create the project

docs/Information_storage.md

Lines changed: 18 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,19 @@
1-
<a id="milacluster_storage"></a>
2-
3-
### Storage
4-
5-
| Path | Performance | Usage | Quota (Space/Files) | Backup | Auto-cleanup |
6-
|------|-------------|-------|---------------------|--------|--------------|
7-
| `/network/datasets/` | High | Curated raw datasets (read only) ||||
8-
| `/network/weights/` | High | Curated models weights (read only) ||||
9-
| `$HOME` or `/home/mila/<u>/<username>/` | Low | Personal user space; specific libraries, code, binaries | 100GB/1000K | Daily | no |
10-
| `$SCRATCH` or `/network/scratch/<u>/<username>/` | High | Temporary job results; processed datasets; optimized for small files | 5TB/no | no | 90 days |
11-
| `$SLURM_TMPDIR` | Highest | High speed disk for temporary job results | no/no | no | at job end |
12-
| `/network/projects/<groupname>/` | Fair | Shared space for collaboration; long-term project storage | 1TB/1000K | Daily | no |
13-
| `$ARCHIVE` or `/network/archive/<u>/<username>/` | Low | Long-term personal storage | 5TB | no | no |
1+
# Storage
2+
3+
| Path | Performance | Usage | Quota (Space/Files) | Backup | Auto-cleanup |
4+
| ------------------------------------------------ | ----------- | -------------------------------------------------------------------- | ------------------- | ------ | ------------ |
5+
| `/network/datasets/` | High | Curated raw datasets (read only) ||||
6+
| `/network/weights/` | High | Curated models weights (read only) ||||
7+
| `$HOME` or `/home/mila/<u>/<username>/` | Low | Personal user space; specific libraries, code, binaries | 100GB/1000K | Daily | no |
8+
| `$SCRATCH` or `/network/scratch/<u>/<username>/` | High | Temporary job results; processed datasets; optimized for small files | 5TB/no | no | 90 days |
9+
| `$SLURM_TMPDIR` | Highest | High speed disk for temporary job results | no/no | no | at job end |
10+
| `/network/projects/<groupname>/` | Fair | Shared space for collaboration; long-term project storage | 1TB/1000K | Daily | no |
11+
| `$ARCHIVE` or `/network/archive/<u>/<username>/` | Low | Long-term personal storage | 5TB | no | no |
1412

1513
!!! note
1614
The `$HOME` file system is backed up once a day. For any file restoration request, file a request to [Mila's IT support](https://it-support.mila.quebec) with the path to the file or directory to restore, with the required date.
1715

18-
#### $HOME
16+
## $HOME
1917

2018
`$HOME` is appropriate for codes and libraries which are small and read once,
2119
as well as the experimental results that would be needed at a later time (e.g.
@@ -29,7 +27,7 @@ million per user. The command to check the quota usage from a login node is:
2927
disk-quota
3028
```
3129

32-
#### $SCRATCH
30+
## $SCRATCH
3331

3432

3533
`$SCRATCH` can be used to store processed datasets, work in progress datasets
@@ -47,14 +45,14 @@ the quota usage from a login node is:
4745
disk-quota
4846
```
4947

50-
#### $SLURM_TMPDIR
48+
## $SLURM_TMPDIR
5149

5250

5351
`$SLURM_TMPDIR` points to the local disk of the node on which a job is
5452
running. It should be used to copy the data on the node at the beginning of the
5553
job and write intermediate checkpoints. This folder is cleared after each job.
5654

57-
#### projects
55+
## projects
5856

5957

6058
`projects` can be used for collaborative projects. It aims to ease the
@@ -67,7 +65,7 @@ of files (inodes). The limits for blocks and inodes are respectively 1TiB and
6765
!!! note
6866
It is possible to request higher quota limits if the project requires it. File a request to [Mila's IT support](https://it-support.mila.quebec).
6967

70-
#### $ARCHIVE
68+
## $ARCHIVE
7169

7270

7371
`$ARCHIVE` purpose is to store data other than datasets that has to be kept
@@ -96,7 +94,7 @@ df -h $ARCHIVE
9694
!!! note
9795
There is **NO** backup of this file system.
9896

99-
#### datasets
97+
## datasets
10098

10199

102100
`datasets` contains curated datasets to the benefit of the Mila community. To
@@ -120,7 +118,7 @@ command:
120118
ssh [CLUSTER_LOGIN] -C "projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh"
121119
```
122120

123-
#### weights
121+
## weights
124122

125123

126124
`weights` contains curated models weights to the benefit of the Mila

0 commit comments

Comments
 (0)