Skip to content

Commit 6651daf

Browse files
Merge
2 parents 78c1e9a + 9304b12 commit 6651daf

5 files changed

Lines changed: 215 additions & 89 deletions

File tree

docs/about.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
# About LiGHT
1+
# Who We Are
22

3-
We are LiGHT (yes)
3+
We, at LiGHT, are a team of passionate individuals committed to building and developing AI systems that improve healthcare innovatively and ethically. We are a students led organization based at the Ecole Polytechnique Fédérale de Lausanne (EPFL) in Switzerland.

docs/clusters/cscs/cscs.md

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,8 @@
22

33
!!! danger "DO NOT RUN ON LOGIN NODE"
44

5-
When you establish a direct connection using `ssh` you connect to the login node. Everyone is on that node and as such **YOU SHOULD NEVER RUN ANY
6-
JOBS DIRECTLY ON THE LOGIN NODE**. If you want to run a process, *like a training*, you can run it on a [dedicated allocated job](#launching-job)
7-
5+
When you establish a direct connection using `ssh` you connect to the login node. Everyone is on that node and as such **YOU SHOULD NEVER RUN ANY**
6+
JOBS DIRECTLY ON THE LOGIN NODE**. If you want to run a process, *like a training*, you can run it on a [dedicated allocated job](#launching-job)
87

98
## Pre-setup (access to the CSCS)
109

@@ -186,13 +185,15 @@ ssh-add -t 1d ~/.ssh/cscs-key
186185

187186
You will have to execute this *bash* script every day. Warning, you are limited in the number of key you can generate by day (roughtly 5 per day) as such be mindful when trying to debug things. You should
188187
launch this script in `bash`, you can run the following commands
188+
189189
```bash
190190
nano cscs-refresh.sh # update the file
191191
chmod +x ./cscs-refresh.sh
192192
bash ./cscs-refresh.sh
193193
```
194194

195195
If you don't want to have your login ID stored in a script, you can comment out the lines:
196+
196197
```bash
197198
#read -p "Username : " USERNAME
198199
#read -s -p "Password: " PASSWORD
@@ -216,7 +217,7 @@ Host ela
216217
ForwardAgent yes
217218
ForwardX11 yes
218219
forwardX11Trusted yes
219-
IdentityFile ~/.ssh/cscs-key
220+
IdentityFile ~/.ssh/cscs-key
220221
221222
222223
Host todi
@@ -226,7 +227,7 @@ Host todi
226227
ForwardAgent yes
227228
ForwardX11 yes
228229
forwardX11Trusted yes
229-
IdentityFile ~/.ssh/cscs-key
230+
IdentityFile ~/.ssh/cscs-key
230231
231232
Host clariden
232233
HostName clariden.cscs.ch
@@ -311,7 +312,6 @@ When running job, you will need to execute your job inside docker images. This i
311312
mkdir /users/$USER/.edf
312313
```
313314

314-
315315
Create a `/users/$USER/.edf/multimodal.toml` file:
316316

317317
```toml
@@ -341,7 +341,6 @@ FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD = "16777216"
341341
FI_CXI_COMPAT = "0"
342342
```
343343

344-
345344
Notice 3 things:
346345

347346
* We specify the path to the `.sqsh` file in the `image` attribute. This is the image used by the job that stores all of the dependencies.
@@ -354,8 +353,8 @@ Note that for other types of job, you will probably require a different image an
354353

355354
There are 2 types of job that you can launch:
356355

357-
- Interactive using `srun` (which gives you a terminal)
358-
- Non-interactive using `sbatch` (which schedule a job)
356+
* Interactive using `srun` (which gives you a terminal)
357+
* Non-interactive using `sbatch` (which schedule a job)
359358

360359
### Interactive job
361360

@@ -367,12 +366,12 @@ srun --time=1:29:59 --partition debug -A a127 --environment=/users/$USER/.edf/mu
367366

368367
Here is a breakdown of the command:
369368

370-
- `--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
371-
- `--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
369+
* `--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
370+
* `--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
372371

373-
- `normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
374-
- `debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
375-
- `xfer`: this partition is meant for data transfer and doesn't claim any GPU
372+
* `normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
373+
* `debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
374+
* `xfer`: this partition is meant for data transfer and doesn't claim any GPU
376375

377376
To check if you have been allocated a node, run the following command in another terminal:
378377

@@ -384,15 +383,15 @@ squeue --me --start
384383

385384
This command will give you a dynamic estimation of the scheduled time (may change as people pass you in the priority queue). Note that this command doesn't output anything if your job has been allocated.
386385

387-
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__ `[clariden][$USER@clariden-lnxxx]`.
386+
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__ `[clariden][$USER@clariden-lnxxx]`.
388387

389388
Furthermore:
390389

391390
```bash
392391
echo $HF_HOME
393392
```
394393

395-
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
394+
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
396395

397396
Launch a training with MultiMeditron by running the following commands:
398397

@@ -487,20 +486,20 @@ echo "END TIME: $(date)"
487486

488487
Make sure to replace all the `$USER` by your username and the `$HF_TOKEN` with your huggingface token. Pay attention to the following parameters:
489488

490-
- `#SBATCH --job-name demo-job` sets the job name to `demo-job`
491-
- `#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
492-
- `#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
493-
- Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
494-
- Note that the part which follows the `#SBATCH` commands will be executed on every node
489+
* `#SBATCH --job-name demo-job` sets the job name to `demo-job`
490+
* `#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
491+
* `#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
492+
* Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
493+
* Note that the part which follows the `#SBATCH` commands will be executed on every node
495494

496495
To queue your job, run:
497496
bash
497+
498498
```
499499
# CSCS login node
500500
501501
sbatch sbatch_train.sh
502502
```
503-
504503

505504
You can check if your job has been allocated GPUs by running:
506505

@@ -509,6 +508,7 @@ You can check if your job has been allocated GPUs by running:
509508

510509
squeue --me
511510
```
511+
512512
This command gives you the `JOBID` of the job you have launched
513513

514514
Once the job enters the `R` state (for running), the job is running. You can check the logs of your job by going into the `reports` directory:
@@ -522,7 +522,6 @@ tail -f R-%x.%j.err
522522

523523
where you need to replace `R-%x.%j.err` by the actual report name.
524524

525-
526525
You can either let the job finishes or cancels the job.
527526

528527
```bash
@@ -538,11 +537,12 @@ where `$JOBID`is the `JOBID` that you get when running `squeue --me`
538537
If you want to join the modern era of computers and have something more involve than a terminal to code (unlike some people), you may want to *"connect"* your visual studio code instance directly to the cluster. This allows to directly modify the code, using the correct environment (so that it doesn't show you half the package as non existent).
539538

540539
#### Procedure
541-
- Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542-
- [Launch a job on the cluster](launching-job)
543-
540+
541+
* Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542+
* [Launch a job on the cluster](#launching-job)
543+
544544
You will need the vscode *CLI* installed on the job you launched.
545-
545+
546546
=== "Use prebuilt image"
547547

548548
You can use the image that I personally used, you can update your environment file, and use the image at `/capstor/store/cscs/swissai/a127/meditron/docker/multimeditron_latest_2.sqsh`. With this solution however you'll inherit from all of my python dependencies. If you want to use your own image, you can check the manual installation.
@@ -565,13 +565,14 @@ If you want to join the modern era of computers and have something more involve
565565
RUN rm -rf /workspace/code
566566
```
567567

568-
- Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel* **within the job**. Go to the folder of your project and run the following command
568+
* Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel* __within the job__. Go to the folder of your project and run the following command
569+
569570
```bash
570571
cd /path/to/my/awesome/project
571572
code tunnel --name=cluster-tunnel
572573
```
573-
This will prompt you to connect to your `github` account, do so.
574574

575+
This will prompt you to connect to your `github` account, do so.
575576

576577
!!! warning "Bug in CSCS after update"
577578

docs/index.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,23 @@
1-
# Welcome to LiGHT!
1+
# Welcome to LiGHT
22

33
LiGHT stands for Laboratory for Intelligent Global Health and Humanitarian Response Technologies
44

55
If you are new to LiGHT, start by reading the [Getting Started tutorial](gettingstarted.md)
66

7+
## Projects
8+
9+
If you are a student interested in joining LiGHT (or just curious about what we do), you
10+
can explore our ongoing projects in the [Projects section](projects.md)
11+
12+
## Clusters
13+
14+
Modern LLMs requires an unresonable amount of computational resources. As such, to develop
15+
and test our models, we have access to multiple clusters. If you are part of the laboratory,
16+
and want to learn more about how to use them, check the [Clusters section](clusters.md)
17+
18+
Please refer to your supervisor or the lab admin for any questions regarding access to the clusters.
19+
20+
## Documentation
21+
22+
This documentation is a work in progress. If you have any suggestions or want to contribute,
23+
please reach out to the lab admin or open an issue on our [GitHub repository](https://github.com/EPFLiGHT/LiGHT-doc/tree/main)

0 commit comments

Comments
 (0)