Skip to content

Commit 4b0e5ec

Browse files
committed
Enhance documentation for CSCS and projects, including formatting updates and new project descriptions
1 parent de74a5a commit 4b0e5ec

3 files changed

Lines changed: 194 additions & 50 deletions

File tree

docs/clusters/cscs/cscs.md

Lines changed: 36 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,8 @@
22

33
!!! danger "DO NOT RUN ON LOGIN NODE"
44

5-
When you establish a direct connection using `ssh` you connect to the login node. Everyone is on that node and as such **YOU SHOULD NEVER RUN ANY
6-
JOBS DIRECTLY ON THE LOGIN NODE**. If you want to run a process, *like a training*, you can run it on a [dedicated allocated job](#launching-job)
7-
5+
When you establish a direct connection using `ssh` you connect to the login node. Everyone is on that node and as such **YOU SHOULD NEVER RUN ANY**
6+
JOBS DIRECTLY ON THE LOGIN NODE**. If you want to run a process, *like a training*, you can run it on a [dedicated allocated job](#launching-job)
87

98
## Pre-setup (access to the CSCS)
109

@@ -186,13 +185,15 @@ ssh-add -t 1d ~/.ssh/cscs-key
186185

187186
You will have to execute this *bash* script every day. Warning, you are limited in the number of key you can generate by day (roughtly 5 per day) as such be mindful when trying to debug things. You should
188187
launch this script in `bash`, you can run the following commands
188+
189189
```bash
190190
nano cscs-refresh.sh # update the file
191191
chmod +x ./cscs-refresh.sh
192192
bash ./cscs-refresh.sh
193193
```
194194

195195
If you don't want to have your login ID stored in a script, you can comment out the lines:
196+
196197
```bash
197198
#read -p "Username : " USERNAME
198199
#read -s -p "Password: " PASSWORD
@@ -216,7 +217,7 @@ Host ela
216217
ForwardAgent yes
217218
ForwardX11 yes
218219
forwardX11Trusted yes
219-
IdentityFile ~/.ssh/cscs-key
220+
IdentityFile ~/.ssh/cscs-key
220221
221222
222223
Host todi
@@ -226,7 +227,7 @@ Host todi
226227
ForwardAgent yes
227228
ForwardX11 yes
228229
forwardX11Trusted yes
229-
IdentityFile ~/.ssh/cscs-key
230+
IdentityFile ~/.ssh/cscs-key
230231
231232
Host clariden
232233
HostName clariden.cscs.ch
@@ -311,7 +312,6 @@ When running job, you will need to execute your job inside docker images. This i
311312
mkdir /users/$USER/.edf
312313
```
313314

314-
315315
Create a `/users/$USER/.edf/multimodal.toml` file:
316316

317317
```toml
@@ -341,7 +341,6 @@ FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD = "16777216"
341341
FI_CXI_COMPAT = "0"
342342
```
343343

344-
345344
Notice 3 things:
346345

347346
* We specify the path to the `.sqsh` file in the `image` attribute. This is the image used by the job that stores all of the dependencies.
@@ -354,8 +353,8 @@ Note that for other types of job, you will probably require a different image an
354353

355354
There are 2 types of job that you can launch:
356355

357-
- Interactive using `srun` (which gives you a terminal)
358-
- Non-interactive using `sbatch` (which schedule a job)
356+
* Interactive using `srun` (which gives you a terminal)
357+
* Non-interactive using `sbatch` (which schedule a job)
359358

360359
### Interactive job
361360

@@ -367,12 +366,12 @@ srun --time=1:29:59 --partition debug -A a127 --environment=/users/$USER/.edf/mu
367366

368367
Here is a breakdown of the command:
369368

370-
- `--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
371-
- `--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
369+
* `--time` is the maximum running time of the job (here, the job runs for 1h30 before it gets killed)
370+
* `--partition debug` is the node partition in which the job executed. As of 14/08/2025, there are 3 partitions:
372371

373-
- `normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
374-
- `debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
375-
- `xfer`: this partition is meant for data transfer and doesn't claim any GPU
372+
* `normal`: with a maximum running time of 12 hours and no limit on the number of distributed nodes. This partition is the partition used for non-interactive jobs and long interactive jobs
373+
* `debug`: with a maximum running time of 1h30 with only one node. This partition is meant for interactive jobs
374+
* `xfer`: this partition is meant for data transfer and doesn't claim any GPU
376375

377376
To check if you have been allocated a node, run the following command in another terminal:
378377

@@ -384,15 +383,15 @@ squeue --me --start
384383

385384
This command will give you a dynamic estimation of the scheduled time (may change as people pass you in the priority queue). Note that this command doesn't output anything if your job has been allocated.
386385

387-
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__ `[clariden][$USER@clariden-lnxxx]`.
386+
Once you have been allocated a job, you will have a terminal inside the allocated node. Make sure that your `bash prompt` is of the form `$USER@nidxxxxxx` (and __not__ `[clariden][$USER@clariden-lnxxx]`.
388387

389388
Furthermore:
390389

391390
```bash
392391
echo $HF_HOME
393392
```
394393

395-
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
394+
Make sure that the output is `/iopsstor/scratch/cscs/$USER/hf`. This is extremely important because if you run trainings without telling it where to download the Llama-3.1 model, it will do so in your working directory `/users/$USER` and you do not have enough storage for that.
396395

397396
Launch a training with MultiMeditron by running the following commands:
398397

@@ -487,20 +486,20 @@ echo "END TIME: $(date)"
487486

488487
Make sure to replace all the `$USER` by your username and the `$HF_TOKEN` with your huggingface token. Pay attention to the following parameters:
489488

490-
- `#SBATCH --job-name demo-job` sets the job name to `demo-job`
491-
- `#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
492-
- `#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
493-
- Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
494-
- Note that the part which follows the `#SBATCH` commands will be executed on every node
489+
* `#SBATCH --job-name demo-job` sets the job name to `demo-job`
490+
* `#SBATCH --nodes 1` means that we are claiming one node (of 4 GPUs). You should increase this if you are launching bigger jobs
491+
* `#SBATCH --output /users/$USER/meditron/reports/R-%x.%j.out` and `#SBATCH --error /users/$USER/meditron/reports/R-%x.%j.err` mean that this will create a folder `/users/$USER/meditron/reports` that stores all the job logs
492+
* Note that here, we execute a training of MultiMeditron with `config/config_alignment.yaml`, thus you need to make sure that the paths of the dataset are correct
493+
* Note that the part which follows the `#SBATCH` commands will be executed on every node
495494

496495
To queue your job, run:
497496
bash
497+
498498
```
499499
# CSCS login node
500500
501501
sbatch sbatch_train.sh
502502
```
503-
504503

505504
You can check if your job has been allocated GPUs by running:
506505

@@ -509,6 +508,7 @@ You can check if your job has been allocated GPUs by running:
509508

510509
squeue --me
511510
```
511+
512512
This command gives you the `JOBID` of the job you have launched
513513

514514
Once the job enters the `R` state (for running), the job is running. You can check the logs of your job by going into the `reports` directory:
@@ -522,7 +522,6 @@ tail -f R-%x.%j.err
522522

523523
where you need to replace `R-%x.%j.err` by the actual report name.
524524

525-
526525
You can either let the job finishes or cancels the job.
527526

528527
```bash
@@ -538,11 +537,12 @@ where `$JOBID`is the `JOBID` that you get when running `squeue --me`
538537
If you want to join the modern era of computers and have something more involve than a terminal to code (unlike some people), you may want to *"connect"* your visual studio code instance directly to the cluster. This allows to directly modify the code, using the correct environment (so that it doesn't show you half the package as non existent).
539538

540539
#### Procedure
541-
- Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542-
- [Launch a job on the cluster](launching-job)
543-
540+
541+
* Install the [Remote development extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
542+
* [Launch a job on the cluster](#launching-job)
543+
544544
You will need the vscode *CLI* installed on the job you launched.
545-
545+
546546
=== "Use prebuilt image"
547547

548548
You can use the image that I personally used, you can update your environment file, and use the image at `/capstor/store/cscs/swissai/a127/meditron/docker/multimeditron_latest_2.sqsh`. With this solution however you'll inherit from all of my python dependencies. If you want to use your own image, you can check the manual installation.
@@ -565,21 +565,22 @@ If you want to join the modern era of computers and have something more involve
565565
RUN rm -rf /workspace/code
566566
```
567567

568-
- Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel* **within the job**. Go to the folder of your project and run the following command
568+
* Once your job has been launched with *vscode* CLI installed, it's time to run the *code tunnel* __within the job__. Go to the folder of your project and run the following command
569+
569570
```bash
570571
cd /path/to/my/awesome/project
571572
code tunnel --name=cluster-tunnel
572573
```
573-
This will prompt you to connect to your `github` account, do so.
574574

575+
This will prompt you to connect to your `github` account, do so.
575576

576577
!!! warning "Bug in CSCS after update"
577578

578579
After previous maintainance of CSCS there was a bug where the following code no longer worked. This was due to multiple proxy variable being set. To fix that bug please use the following code:
579580

580-
```base
581-
unset {http,https,no}_proxy
582-
unset {HTTP,HTTPS,NO}_PROXY
583-
```
584-
585-
- Finally, open vscode locally on your computer then in the remote extension select the appropriate tunnel and that's it, you are in !
581+
```base
582+
unset {http,https,no}_proxy
583+
unset {HTTP,HTTPS,NO}_PROXY
584+
```
585+
586+
* Finally, open vscode locally on your computer then in the remote extension select the appropriate tunnel and that's it, you are in !

docs/projects.md

Lines changed: 143 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,147 @@
11
# Projects for Fall 2025
22

3-
## 1. Meditron
3+
<div class="grid cards" markdown>
4+
5+
- :material-text-box-search-outline: __MMORE__
6+
7+
---
8+
9+
[MMORE](https://github.com/swiss-ai/mmore) is our Python library for a scalable multimodal pipeline for *processing*, *indexing*, and *querying* multimodal documents. It is used for *retrieval augmented generation* (RAG) applications.
10+
11+
[:octicons-arrow-right-24: See MMORE information](#mmore-mirage)
12+
13+
- :material-palm-tree:{ .lg .middle } __MIRAGE__
14+
15+
---
16+
17+
[MIRAGE](https://github.com/EPFLiGHT/MIRAGE) is a platform designed to streamline the processing of datasets using generative models.
18+
19+
[:octicons-arrow-right-24: See MIRAGE information](#mmore-mirage)
20+
21+
- :fontawesome-solid-language:{ .lg .middle } __Polyglot__
22+
23+
---
24+
25+
*Polyglot Meditron* is a project aimed at evaluating and enhancing the multilingual capabilities of our Meditron model.
26+
27+
[:octicons-arrow-right-24: See Multilingual Meditron information](#polyglot-meditron)
28+
29+
- :material-human-male-board-poll:{ .lg .middle } __LiGHT Bootcamp__
30+
31+
---
32+
33+
Improve the content of the MOOC on *AI applied to healthcare*
34+
35+
[:octicons-arrow-right-24: See LiGHT Bootcamp information](#light-ai-bootcamp)
36+
37+
- :material-file-image-plus:{ .lg .middle } __MultiMeditron__
38+
39+
---
40+
41+
Improve Meditron's multimodal capabilities by enabling it to process and understand multiple modalities.
42+
43+
[:octicons-arrow-right-24: See MultiMeditron information](#multimeditron)
44+
45+
- :material-cpu-32-bit:{ .lg .middle } __Quantisation of Medical LLMs__
46+
47+
---
48+
49+
Explore model quantisation in practice and document the results in a reproducible way.
50+
51+
[:octicons-arrow-right-24: See Quantisation of Medical LLMs information](#quantisation-of-medical-llms)
52+
53+
- :material-bottle-tonic:{ .lg .middle } __Distillation of Medical LLMs__
54+
55+
---
56+
57+
Explore knowledge distillation for language models, with an emphasis on comparing different distillation strategies, data choices, and model architectures.
58+
59+
[:octicons-arrow-right-24: See Distillation of Medical LLMs information](#distillation-of-medical-llms)
60+
61+
</div>
62+
63+
## MMORE & Mirage
64+
65+
*This project is supervised by Fabrice Nemo*
66+
67+
[MMORE](https://github.com/swiss-ai/mmore) stands for Massive Multimodal Open RAG & Extraction, it is our Python library for a scalable multimodal pipeline for processing, indexing, and querying multimodal documents.
68+
[MIRAGE](https://github.com/EPFLiGHT/MIRAGE) stands for Multimodal Intelligent Reformatting and Augmentation Generation Engine, it is our advanced platform designed to streamline the processing of datasets using generative models.
69+
70+
The aim of these two projects is to work on maintaining the library: solve the issues raised by the community, fix bugs, make new features that could be useful and challenging for students (would be suggested by students or by Fabrice if the idea comes up as important enough).
71+
72+
## Polyglot Meditron
73+
74+
*This project is supervised by Fabrice Nemo*
75+
76+
Speaking English is nice, most content online is in English. Having a performant LLM for medical tasks formulated in English is useful. But not enough! In low-resource settings and even in most places of the globe, people usually prefer using their first language rather than English.
77+
78+
This project aims at making Meditron models more proficient in other languages, with a focus on low-resource languages (current focus on Amharic, Hindi, Swahili, Tamil, eventually also Arabic, Bembe, French, Kinyarwanda, Luo, Nyanja, Twi, Urdu). In written and spoken speech. Work is needed, since having a polyglot base model is generally not enough: popular models do not have a focus on low-resource languages, and there is also a need to make sure to teach the model non-English medical terminology.
79+
80+
## LiGHT AI Bootcamp
81+
82+
*This project is supervised by Fabrice Nemo*
83+
84+
Teaching the basics of AI applied to healthcare. We already have a MOOC (almost) ready. The target audience is healthcare workers and computer scientists in Africa, who would be following the MOOC with human mentoring provided by LiGHT. Our work in LiGHT is to improve the content of the MOOC so that students learn better, and mentor students in Africa, guide them throughout their completion of the bootcamp.
85+
Students may work on this project either as a side project (1 or 2 hours per week of mentoring) or as a full time semester/optional project (for instance for making deeper research on how to improve the MOOC with evidence from educational science, for developing more evaluation content… To be discussed with Fabrice).
86+
87+
## MultiMeditron
88+
89+
*This project is co-supervised by David Sasu, Lars Klein, Frabrice Nemo and Arianna Francesconi*
90+
91+
This project aimed at improving Meditron __multimodal capabilities__. Healthcare data is often multimodal, combining text, images, signals, and other data types. Enabling Meditron to process and understand multiple modalities can significantly enhance its performance in medical applications.
92+
93+
The goal of this project is:
94+
95+
- Adapting the codebase of Meditron to make it have a multimodal architecture, adapted to new modalities (for now the codebase only supports images)
96+
- Making and improving the "expert" models that process the modalities and make embeddings fed to Meditron.
97+
98+
## Quantisation of Medical LLMs
99+
100+
*This project is supervised by Lars Klein*
101+
102+
This project focuses on exploring model quantisation in practice and documenting the results in a reproducible way. The goal is to gain hands-on experience with commonly used quantisation tools and to produce clear notes and artifacts that capture their behavior, trade-offs, and performance characteristics.
103+
104+
__Tasks:__
105+
106+
- Select and evaluate 1-3 quantisation tools (e.g. __bitsandbytes__, __llama.cpp__)
107+
- Apply quantisation to one or more models and document the process in detail, including:
108+
- exact steps taken,
109+
- runtime of the quantisation process,
110+
- resulting model size and size reduction.
111+
- Run the quantised models through benchmarks and record:
112+
- inference speed,
113+
- resource usage,
114+
- any observable changes in output quality or behavior.
115+
116+
__Required Experience:__
117+
118+
- Basic __Python__ programming
119+
- Familiarity with running machine learning models from the command line or in scripts
120+
121+
## Distillation of Medical LLMs
122+
123+
*This project is supervised by Lars Klein and Arianna Francesconi*
124+
125+
This project explores knowledge distillation for language models, with an emphasis on comparing different distillation strategies, data choices, and model architectures. The aim is to better understand how teacher selection, loss functions, and training data affect the performance and efficiency of distilled student models.
126+
127+
__Tasks:__
128+
129+
- Identify suitable teacher models and, if needed, construct datasets for intermediate representations (e.g. activations).
130+
- Implement and compare different distillation losses, including:
131+
- logit-based distillation,
132+
- MiniLM-style objectives
133+
- Experiment with different training datasets, such as:
134+
- general-purpose corpora,
135+
- task-specific datasets.
136+
- Benchmark distilled models on relevant evaluation tasks and performance metrics.
137+
- Explore distillation across architectures, including heterogeneous setups (e.g. LFM-style distillation between models such as Apertus and Meditron).
138+
139+
__Required Experience:__
140+
141+
- Solid Python programming
142+
- Familiarity with training and evaluating neural networks
143+
- Basic understanding of language models and knowledge distillation techniques
144+
<!-- ## 1. Meditron
4145
5146
- **MultiMeditron**
6147
@@ -98,6 +239,4 @@ This project extends the [IMBALMED method](https://www.sciencedirect.com/science
98239
99240
Contact: Arianna Francesconi (arianna.francesconi@epfl.ch)
100241
101-
102-
103-
242+
-->

0 commit comments

Comments
 (0)