👉 For information on generic organisation of projects, see these slides (accessible with EPFL gdrive login).
Sections:
- GitHub repository
- Working with a remote EPFL server
- Managing jobs on the cluster
- GPU usage
- Data management
- Best practices for working on a cluster node (DHLAB -
iccluster040) - Debugging and troubleshooting
- Reproducibility and good practices
- use lower case;
- use hyphens to separate tokens;
- if related to a larger project, start with the name of this project, followed by the name of your project (e.g.
impresso-image-classificationwhereimpressowould be the name of the project); - in case of doubt, ask your supervisors.
You are free to structure your repository as you wish, but we advise having:
- a
notebooksfolder, for your working notebook; - a
libfolder, in case you convert your notebook to scripts with a command line interface; - a
reportfolder, where you put the PDF and LaTex sources of your report; - a README, with the information specified below.
-
Basic information
- your name
- the names of supervisors
- the academic year
-
About: include a brief introduction of your project.
-
Research summary: include a brief summary of your approaches/implementations and an illustration of your results.
-
Installation and Usage
- dependencies: platform, libraries (for Python include a
requirements.txtfile) - compilation (if necessary)
- usage: how to run your code
- dependencies: platform, libraries (for Python include a
-
License
We encourage you to choose an open license (e.g. AGPL, GPL, LGPL or MIT).
License files are already available in GH (add new file, start typing license, choices will appear).
You can also add the following at the end of your README:project_name - Jean Dupont Copyright (c) Year EPFL This program is licensed under the terms of the [license].
If necessary, your lab (DHLAB or LHST) can grant you access to a machine on the IC cluster:
- ask your supervisor to gain access;
- you need to be on campus or use VPN to access the machine;
- $USER = your Gaspar username
- login with gaspar credentials:
ssh [email protected] where 'XX' is the machine number
- 256GB of RAM
- 2 GPUs
- 200GB of disk space on
/ - 12TB of disk space under
/scratch(For DHLAB, THIS has newly changed! to/rcp-scratch/iccluster040_scratch/)
/, do not store your data (i.e. datasets, intermediate results, models, etc.) in your home but under /rcp-scratch/iccluster040_scratch/students/$USER/.
When first connecting via SSH you will see a fingerprint message. Type yes to continue.
The authenticity of host 'iccluster0XX.iccluster.epfl.ch (XX.XX.XX.XX)' can't be established.
ECDSA key fingerprint is SHA256:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
Are you sure you want to continue connecting (yes/no/[fingerprint])?Efficient data transfer is essential when working with remote cluster machines. You can use either scp or rsync to copy files or entire folders between your local machine and the cluster.
From your local machine to the remote server:
scp -r /path/to/local/file.txt $USER@iccluster0XX.iccluster.epfl.ch:/scratch/students/$USER/
(DHLAB iccluster040) scp -r /path/to/local/file.txt $USER@iccluster040.iccluster.epfl.ch:/rcp-scratch/iccluster040_scratch/students/$USER/From the remote server to your local machine:
scp $USER@iccluster0XX.iccluster.epfl.ch:/scratch/students/$USER/file.txt /path/to/local/
(DHLAB iccluster040) scp $iccluster040.iccluster.epfl.ch:/rcp-scratch/iccluster040_scratch/students/$USER/file.txt /path/to/local/Copying a folder recursively:
scp -r $USER@iccluster0XX.iccluster.epfl.ch:/scratch/students/$USER/folder /path/to/local/
(DHLAB iccluster040) scp -r $iccluster040.iccluster.epfl.ch:/rcp-scratch/iccluster040_scratch/students/$USER/file.txt /path/to/folder
rsync is more efficient for large datasets or syncing folders incrementally.
From your local machine to the remote server:
rsync -avh /path/to/local/folder/ $USER@iccluster0XX.iccluster.epfl.ch:/scratch/students/$USER/folder/
(DHLAB iccluster040) rsync -avh /path/to/local/folder/ $USER@iccluster040.iccluster.epfl.ch:/rcp-scratch/iccluster040_scratch/students/$USER/folder/From the remote server to your local machine:
rsync -avh $USER@iccluster0XX.iccluster.epfl.ch:/scratch/students/$USER/folder/ /path/to/local/folder/
(DHLAB iccluster040) rsync -avh $USER@iccluster040.iccluster.epfl.ch:/rcp-scratch/iccluster040_scratch/students/$USER/folder/ /path/to/local/folder/Common flags used:
-a: archive mode (preserves permissions, symbolic links, etc.)-v: verbose (shows what’s happening)-h: human-readable sizes--progress: (optional) shows progress during transfer
⚠️ Important: Always transfer data to and from the/scratch/students/$USER/directory, not/home, to avoid quota limits and ensure good practices on shared machines.
⚠️ Important: For DHLAB, on nodeiccluster040, always transfer data to and from the/rcp-scratch/iccluster040_scratch/students/$USER/directory, not/home, to avoid quota limits and ensure good practices on shared machines.
The /home partition on iccluster040 is very small and shared across all users. To avoid filling it up and causing problems for everyone, please follow these guidelines.
-
Where to put it
- For student projects:
/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/ - Otherwise:
/rcp-scratch/iccluster040_scratch/YOUR_USERNAME/
- For student projects:
-
Why
/home(root) is small and not meant for heavy use.- Storing in
/rcp-scratchavoids quota issues and prevents system lock-ups oficcluster040. - Keep your code in Git and clone it on
iccluster040or other machines/nodes (best practice everywhere). - (Extra) If you have access to Run:AI, you can run the code directly from there.
-
Where to put it
- Short-term (work in progress):
/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/ - Long-term (final datasets or valuable resources, if you have access):
/mnt/u12632_cdh_dhlab_002_files_nfs/oncdhvm0002.xaas.epfl.ch(NAS)
- Short-term (work in progress):
-
Why
/scratchhas been copied to~/rcp-scratch/iccluster040_scratch/but may be wiped./homeis space-limited and not safe for long-term storage.
-
Move your conda environments
Follow these instructions,
but instead of/scratch/students/use:
/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/ -
Move your
.cachedirectory
From/home/YOUR_USERNAMEto:
/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/.cache/Create the folder first:
mkdir -p ~/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/.cache/Then export the following variables:
export HF_HOME=~/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/.cache/ export HF_BASE=~/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/.cache/ export TRANSFORMERS_CACHE=~/rcp-scratch/iccluster040_scratch/students/YOUR_USERNAME/.cache/
You can either run these commands every time or add them permanently to your ~/.bashrc.
/home. Always use the scratch space.
-
Generic cluster nodes:
Use/scratch/students/$USER/ -
DHLAB-specific node (
iccluster040):
Use/rcp-scratch/iccluster040_scratch/students/$USER/
- Create a local Python environment using
conda,virtualenv, orpipenv. - Note: Before creating environments, you need to create your user folder, e.g.:
(replace with
mkdir -p /rcp-scratch/iccluster040_scratch/students/$USER/scratch/students/$USER/on other nodes). - To easily code locally and run things remotely, configure your IDE to sync code to the remote server (e.g. PyCharm, VS Code).
Configure conda so environments are stored in the scratch space:
# On generic nodes
conda config --add envs_dirs /scratch/students/$USER/.conda/envs
conda config --add pkgs_dirs /scratch/students/$USER/.conda/pkgs
# On iccluster040
conda config --add envs_dirs /rcp-scratch/iccluster040_scratch/students/$USER/.conda/envs
conda config --add pkgs_dirs /rcp-scratch/iccluster040_scratch/students/$USER/.conda/pkgsThen create a new environment with Python 3.11:
export PYTHONUTF8=1
export LANG=C.UTF-8
conda create -n py311 python=3.11 anacondaActivate: source activate py311
Deactivate: source deactivate
Default system Python is 3.8, but we recommend using the latest version.
# Generic nodes
virtualenv /scratch/students/$USER/testenv
# iccluster040
virtualenv /rcp-scratch/iccluster040_scratch/students/$USER/testenv
source /.../students/$USER/testenv/bin/activateAvoid memory errors:
mkdir /.../students/$USER/tmp
export TMPDIR=/.../students/$USER/tmp/
pip install torch(Replace /.../students/$USER/ with either /scratch/students/$USER/ or /rcp-scratch/iccluster040_scratch/students/$USER/ depending on the node.)
To avoid /home usage, configure pipenv to use scratch space:
# Generic nodes
mkdir -p /scratch/students/$USER/.pipenv_tmpdir
export TMPDIR="/scratch/students/$USER/.pipenv_tmpdir"
# iccluster040
mkdir -p /rcp-scratch/iccluster040_scratch/students/$USER/.pipenv_tmpdir
export TMPDIR="/rcp-scratch/iccluster040_scratch/students/$USER/.pipenv_tmpdir"Also add to your ~/.bashrc:
export PIPENV_VENV_IN_PROJECT=1Some sources:
To access an instance of Jupyter notebook or Jupyter Lab running on a remote server, you can either configure Jupyter accordingly or use an SSH tunnel.
Have a look at the official Jupyter documentation.
In summary, you have to:
- Run
jupyter notebook --generate-config. This will create a.jupyterfolder in your home (hidden, usels -a). - Use
jupyter notebook passwordto set a password. - Edit the
jupyter_notebook_config.pyfile in order to set the port where Jupyter will broadcast. The following three lines are needed, all the rest can be commented:
c = get_config()
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.port = XXX <= change this port; use the last four digits of your SCIPER number (to avoid colliding with other people on the same port).
You can ignore SSL certificates.
Notebook will be accessible at:
http://iccluster0XX.iccluster.epfl.ch:XXXX
In order to leave it open while you are executing things, you can run the notebook in screen (see below).
You can also use an SSH tunnel, which is easier, but somewhat more brittle (you will need to reconnect the SSH tunnel when you lose the connection, e.g. because you suspended your machine).
- Connect to the remote server by setting up an SSH tunnel. Run the following command from your local machine. As the port number XXXX, use the last four digits of your SCIPER number (to avoid colliding with other people on the same port):
ssh -L XXXX:localhost:XXXX [gasparname]@iccluster0NN.iccluster.epfl.ch - Launch Jupyter notebook or lab on the node (again replacing XXXX with the same port number):
jupyter notebook --no-browser --port=XXXX
Your notebook is now accessible at https://localhost:XXXX. You may need a token, so look at the message given by Jupyter notebook / lab when you run it.
In order to leave it open while you are executing things, you should run the notebook in a screen (see below).
Main commands:
- create a session:
screen -S name_of_the_session=> you are in - "detach" from screen session:
Ctrl-A D=> session still running, you are out - reconnect (reattach) to a session:
screen -r name_of_the_session - kill a session: from within the session,
Ctrl-A K - in case you are reconnecting from inside (by mistake):
screen -rd name_of_the_session - to list all active screens:
screen -ls
Cluster nodes are shared. To avoid blocking others:
- Always run heavy processes inside
screenortmux. - Monitor resources:
top htop nvidia-smi -l 2
- Kill runaway processes:
ps -u $USER kill -9 <PID>
Some sources:
- Tutorial: https://linuxize.com/post/how-to-use-linux-screen/
- Full documentation: https://linux.die.net/man/1/screen
- Also available via
man screen
Steps to work with a notebook in a screen:
cd [your repo]screen -S work=> you are in a screen named "work" where you will launch the notebook- activate your env
- start Jupyter notebook (
jupyter notebook, orjupyter notebook --no-browser --port=XXXXif you use a SSH tunnel) - open the URL in your web browser
http://iccluster0XX.iccluster.epfl.ch:XXXXif you configured remote accesshttp://localhost:XXXXif you use an SSH tunnel
- if everything is ok, then detach the screen (
Ctr-a d). You can now work in the notebook, open and close your browser as you want, it will keep running.
- Default precision: FP32 (float32)
- FP16/bfloat16 may not be supported on some GPUs.
Example (PyTorch):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)Free memory when done:
del model
torch.cuda.empty_cache()- Always store under:
/scratch/students/$USER/data/ - Compress large corpora with
.tar.gzor.bz2. - Coordinate with supervisors before duplicating datasets.
- Clean up unused checkpoints and logs.
- Use
rsync --progressfor efficient large transfers.
-
Environment issues
- Check:
which python - If
pipfails:mkdir /rcp-scratch/iccluster040_scratch/students/$USER/tmp export TMPDIR=/rcp-scratch/iccluster040_scratch/students/$USER/tmp
- Check:
-
GPU errors
CUDA out of memory: reduce batch size, sequence length, or use gradient accumulation.- Driver mismatch: restart session, check with
nvidia-smi.
-
SSH disconnects
- Use
screenortmux
- Use
- Version code with Git
- Save environment files (
requirements.txt,environment.yml) - Fix random seeds for experiments
- Document experiments in README or notebooks
- Clean up unused files on the cluster
- Use open licenses (AGPL, GPL, LGPL, MIT)