You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We, at LiGHT, are a team of passionate individuals committed to building and developing AI systems that improve healthcare innovatively and ethically. We are a students led organization based at the Ecole Polytechnique Fédérale de Lausanne (EPFL) in Switzerland.
You will have to execute this *bash* script every day. Warning, you are limited in the number of key you can generate by day (roughtly 5 per day) as such be mindful when trying to debug things. You should
188
187
launch this script in `bash`, you can run the following commands
188
+
189
189
```bash
190
190
nano cscs-refresh.sh # update the file
191
191
chmod +x ./cscs-refresh.sh
192
192
bash ./cscs-refresh.sh
193
193
```
194
194
195
195
If you don't want to have your login ID stored in a script, you can comment out the lines:
196
+
196
197
```bash
197
198
#read -p "Username : " USERNAME
198
199
#read -s -p "Password: " PASSWORD
@@ -216,7 +217,7 @@ Host ela
216
217
ForwardAgent yes
217
218
ForwardX11 yes
218
219
forwardX11Trusted yes
219
-
IdentityFile ~/.ssh/cscs-key
220
+
IdentityFile ~/.ssh/cscs-key
220
221
221
222
222
223
Host todi
@@ -226,7 +227,7 @@ Host todi
226
227
ForwardAgent yes
227
228
ForwardX11 yes
228
229
forwardX11Trusted yes
229
-
IdentityFile ~/.ssh/cscs-key
230
+
IdentityFile ~/.ssh/cscs-key
230
231
231
232
Host clariden
232
233
HostName clariden.cscs.ch
@@ -311,7 +312,6 @@ When running job, you will need to execute your job inside docker images. This i
311
312
mkdir /users/$USER/.edf
312
313
```
313
314
314
-
315
315
Create a `/users/$USER/.edf/multimodal.toml` file:
-`--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
371
-
-`--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
369
+
*`--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
370
+
*`--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
372
371
373
-
-`normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
374
-
-`debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
375
-
-`xfer`: this partition is meant for data transfer and doesn't claim any GPU
372
+
*`normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
373
+
*`debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
374
+
*`xfer`: this partition is meant for data transfer and doesn't claim any GPU
376
375
377
376
To check if you have been allocated a node, run the following command in another terminal:
378
377
@@ -384,15 +383,15 @@ squeue --me --start
384
383
385
384
This command will give you a dynamic estimation of the scheduled time (may change as people pass you in the priority queue). Note that this command doesn't output anything if your job has been allocated.
386
385
387
-
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__`[clariden][$USER@clariden-lnxxx]`.
386
+
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__`[clariden][$USER@clariden-lnxxx]`.
388
387
389
388
Furthermore:
390
389
391
390
```bash
392
391
echo$HF_HOME
393
392
```
394
393
395
-
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
394
+
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
396
395
397
396
Launch a training with MultiMeditron by running the following commands:
398
397
@@ -487,20 +486,20 @@ echo "END TIME: $(date)"
487
486
488
487
Make sure to replace all the `$USER` by your username and the `$HF_TOKEN` with your huggingface token. Pay attention to the following parameters:
489
488
490
-
-`#SBATCH --job-name demo-job` sets the job name to `demo-job`
491
-
-`#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
492
-
-`#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
493
-
- Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
494
-
- Note that the part which follows the `#SBATCH` commands will be executed on every node
489
+
*`#SBATCH --job-name demo-job` sets the job name to `demo-job`
490
+
*`#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
491
+
*`#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
492
+
* Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
493
+
* Note that the part which follows the `#SBATCH` commands will be executed on every node
495
494
496
495
To queue your job, run:
497
496
bash
497
+
498
498
```
499
499
# CSCS login node
500
500
501
501
sbatch sbatch_train.sh
502
502
```
503
-
504
503
505
504
You can check if your job has been allocated GPUs by running:
506
505
@@ -509,6 +508,7 @@ You can check if your job has been allocated GPUs by running:
509
508
510
509
squeue --me
511
510
```
511
+
512
512
This command gives you the `JOBID` of the job you have launched
513
513
514
514
Once the job enters the `R` state (for running), the job is running. You can check the logs of your job by going into the `reports` directory:
@@ -522,7 +522,6 @@ tail -f R-%x.%j.err
522
522
523
523
where you need to replace `R-%x.%j.err` by the actual report name.
524
524
525
-
526
525
You can either let the job finishes or cancels the job.
527
526
528
527
```bash
@@ -538,11 +537,12 @@ where `$JOBID`is the `JOBID` that you get when running `squeue --me`
538
537
If you want to join the modern era of computers and have something more involve than a terminal to code (unlike some people), you may want to *"connect"* your visual studio code instance directly to the cluster. This allows to directly modify the code, using the correct environment (so that it doesn't show you half the package as non existent).
539
538
540
539
#### Procedure
541
-
- Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542
-
-[Launch a job on the cluster](launching-job)
543
-
540
+
541
+
* Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542
+
*[Launch a job on the cluster](#launching-job)
543
+
544
544
You will need the vscode *CLI* installed on the job you launched.
545
-
545
+
546
546
=== "Use prebuilt image"
547
547
548
548
You can use the image that I personally used, you can update your environment file, and use the image at `/capstor/store/cscs/swissai/a127/meditron/docker/multimeditron_latest_2.sqsh`. With this solution however you'll inherit from all of my python dependencies. If you want to use your own image, you can check the manual installation.
@@ -565,13 +565,14 @@ If you want to join the modern era of computers and have something more involve
565
565
RUN rm -rf /workspace/code
566
566
```
567
567
568
-
- Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel***within the job**. Go to the folder of your project and run the following command
568
+
* Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel*__within the job__. Go to the folder of your project and run the following command
569
+
569
570
```bash
570
571
cd /path/to/my/awesome/project
571
572
code tunnel --name=cluster-tunnel
572
573
```
573
-
This will prompt you to connect to your `github` account, do so.
574
574
575
+
This will prompt you to connect to your `github` account, do so.
0 commit comments