Skip to content

Commit 3a66189

Browse files
authored
Revise RCP connection documentation
Updated documentation for connecting to RCP, including changes to email contacts, paths for scratch storage, and command examples for job submission.
1 parent 6651daf commit 3a66189

1 file changed

Lines changed: 37 additions & 38 deletions

File tree

docs/clusters/rcp/rcp.md

Lines changed: 37 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## 1. Pre-setup (access to scratch and cluster)
44

5-
Please ask [Mark](mailto:mark.wagner@epfl.ch) or [Peter](mailto:peter.ahumuza@epfl.ch) to add you to the corresponding groups (send a message on Slack to get a faster answer). You can check your groups at https://groups.epfl.ch/
5+
Please ask [Peter](mailto:peter.ahumuza@epfl.ch) to add you to the corresponding groups (send a message on Slack to get a faster answer). You can check your groups at https://groups.epfl.ch/
66

77
## 2. Setting-up credentials
88

@@ -65,7 +65,7 @@ Finally, we will execute an action that requires our identification on GitHub to
6565
```bash
6666
# SSH terminal
6767

68-
git clone https://github.com/EPFLiGHT/MultiMeditron.git
68+
git clone https://github.com/some/private_repo.git
6969
```
7070
If you were able to clone the repo, then your setup is correct.
7171

@@ -89,8 +89,8 @@ Install kubectl
8989
```bash
9090
# Your terminal (either WSL, Linux or Mac)
9191

92-
curl -LO "https://dl.k8s.io/release/v1.29.6/bin/darwin/arm64/kubectl"
93-
# Linux: curl -LO "https://dl.k8s.io/release/v1.29.6/bin/linux/amd64/kubectl"
92+
curl -LO "https://dl.k8s.io/release/v1.29.6/bin/linux/amd64/kubectl" # Linux
93+
# curl -LO "https://dl.k8s.io/release/v1.29.6/bin/darwin/arm64/kubectl" # macOS
9494

9595
# Give it the right permissions and move it.
9696
chmod +x ./kubectl
@@ -113,8 +113,8 @@ Install the run:ai CLI for RCP (two RCP clusters):
113113
# Your terminal
114114

115115
# Download the CLI from the link shown in the help section.
116-
# for Linux: replace `darwin` with `linux`
117-
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin
116+
# for macOS: replace `linux` with `darwin`
117+
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux
118118
# Give it the right permissions and move it.
119119
chmod +x ./runai
120120
sudo mv ./runai /usr/local/bin/runai
@@ -123,7 +123,7 @@ sudo chown root: /usr/local/bin/runai
123123

124124
## 4. Login
125125

126-
The RCP is organized into a [3 level hierarchy](https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-use-runai#access-hierarchy). The department is the laboratory (e.g. LIGHT or MLO). The projects determine which scratch (aka persistent storage) we have access to. Note that you should choose the SSO option when executing `runai login`.
126+
The RCP is organized into a [3 level hierarchy](https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-use-runai#access-hierarchy). The department is the laboratory (e.g. LiGHT). The projects determine which scratch (aka persistent storage) we have access to. Note that you should choose the SSO option when executing `runai login`.
127127

128128

129129
```bash
@@ -137,65 +137,64 @@ runai config project light-$GASPAR
137137

138138
## 5. Submit a job
139139

140-
Time to test if we can submit a job! This command will allocate 1 GPU from the cluster and "sleep" to infinity (meaning that it will do essentially nothing)
140+
Build your image following the [Docker tutorial](rcp_docker.md). Once you are done, it's time to test if we can submit a job! This command will allocate 1 GPU from the cluster and "sleep" to infinity (meaning that it will do essentially nothing)
141141

142142
```bash
143143
# Your terminal
144144

145145
runai submit \
146-
--name meditron-basic \
146+
--name base-job \
147147
--image registry.rcp.epfl.ch/multimeditron/basic:latest-$GASPAR\
148-
--pvc light-scratch:/mloscratch \
148+
--pvc light-scratch:/lightscratch \
149149
--large-shm \
150-
-e NAS_HOME=/mloscratch/users/$GASPAR \
151-
-e HF_API_KEY_FILE_AT=/mloscratch/users/$GASPAR/keys/hf_key.txt \
152-
-e WANDB_API_KEY_FILE_AT=/mloscratch/users/$GASPAR/keys/wandb_key.txt \
153-
-e GITCONFIG_AT=/mloscratch/users/$GASPAR/.gitconfig \
154-
-e GIT_CREDENTIALS_AT=/mloscratch/users/$GASPAR/.git-credentials \
155-
-e VSCODE_CONFIG_AT=/mloscratch/users/$GASPAR/.vscode-server \
150+
-e NAS_HOME=/lightscratch/users/$GASPAR \
151+
-e HF_API_KEY_FILE_AT=/lightscratch/users/$GASPAR/keys/hf_key.txt \
152+
-e WANDB_API_KEY_FILE_AT=/lightscratch/users/$GASPAR/keys/wandb_key.txt \
153+
-e GITCONFIG_AT=/lightscratch/users/$GASPAR/.gitconfig \
154+
-e GIT_CREDENTIALS_AT=/lightscratch/users/$GASPAR/.git-credentials \
155+
-e VSCODE_CONFIG_AT=/lightscratch/users/$GASPAR/.vscode-server \
156156
--backoff-limit 0 \
157157
--run-as-gid 84257 \
158158
--node-pool h100 \
159159
--gpu 1 \
160160
-- sleep infinity
161161
```
162162

163-
> Note: If you have issue with the job not being launched (after doing a `describe`), ensure that there is such an image in [the registry](registry.rcp.epfl.ch). You can build your image following the docker tutorial.
163+
> Note: If you have issue with the job not being launched (after doing a `describe`), ensure that there is such an image in [the registry](https://registry.rcp.epfl.ch). You can build your image following the docker tutorial.
164+
> Note: It is heavily recommended to save this command into a `shell` file, to easily edit it and launch jobs with `bash connect.sh` for instance. You may have two files, one for CPU-only jobs and one for GPU (make sure to give them different names). Generally you don't need more than those two. **Important to know: jobs with GPUs are time limited, they are automatically shut down after 2 hours of using no GPUs, whereas there is no such limitation for CPU-only jobs**.
164165
165166
Explanation:
166167

167168
* `name` is the name of the job
168169
* `image` is the link to the docker image that will be attached to the cluster. **Please note that you may need to change the image path if you pushed your image on another link. See [Building Docker image for the RCP](rcp_docker.md)**
169-
* `pvc` determines which scratch will be mounted to the job. The argument is of the form: `name_of_the_scratch:/mount/path/to/scratch`. Here the we are mounting the scratch named `light-scratch` to the local path `/mloscratch` **This is part may cause an error because of the LIGHT migration**
170-
* `gpu` is the number of GPU that you want to claim for this job (larger amount of GPU will be harder to get as ressources are limited)
171-
172-
> Note: It is heavily recommended to save this command into a `shell` file, to easily edit it and launch jobs with `bash connect.sh` for instance. You may have two files, one for CPU-only jobs and one for GPU (make sure to give them different names). Generally you don't need more than those two. Important to know: jobs with GPUs are time limited, they are automatically shut down after 2 hours of using no GPUs, whereas there is no such limitation for CPU-only jobs.
170+
* `pvc` determines which scratch will be mounted to the job. The argument is of the form: `name_of_the_scratch:/mount/path/to/scratch`. Here the we are mounting the scratch named `light-scratch` to the local path `/lightscratch` **This part may cause an error because of the LIGHT migration**
171+
* `gpu` is the number of GPUs that you want to claim for this job (larger amount of GPU will be harder to get, as ressources are limited)
173172

174173
We can check the outputs of our container and the status of the job using the following commands respectively.
175174
```bash
176175
# Your terminal
177176

178-
runai logs meditron-basic
179-
runai describe job meditron-basic
177+
runai logs base-job
178+
runai describe job base-job
180179
```
181180

182181
To end a job, run the command:
183182
```bash
184183
# Your terminal
185184

186-
runai delete job meditron-basic
185+
runai delete job base-job
187186
```
188187

189188
You can access your job by doing
190189
```bash
191190
# Your terminal
192191

193-
runai bash meditron-basic
192+
runai bash base-job
194193
```
195194

196-
You should see a terminal opening. By default it redirects you to the folder `/workspace`. This folder is not persistent, it is heavily recommended to always start using the job by moving to your personal folder you made earlier, at `/mloscratch/users/$GASPAR_USER`.
195+
You should see a terminal opening. By default it redirects you to the folder `/workspace`. This folder is not persistent, **it is heavily recommended to always start using the job by moving to your personal folder you made earlier, at `/lightscratch/users/$GASPAR_USER`**.
197196

198-
Enter the following command in your new terminal to ensure that you have indeed a GPU:
197+
You may enter the following command in your new terminal to ensure that you have indeed a GPU:
199198
```bash
200199
# Job terminal
201200

@@ -206,7 +205,7 @@ Once you are done, run the following command to delete the job:
206205
```bash
207206
# Your terminal
208207

209-
runai delete job meditron-basic
208+
runai delete job base-job
210209
```
211210

212211
## 6. VSCode connection
@@ -218,7 +217,7 @@ Once we have the container running on a node of the RCP cluster, we can attach t
218217
* [Kubernetes](https://marketplace.visualstudio.com/items?itemName=ms-kubernetes-tools.vscode-kubernetes-tools)
219218
* [Dev containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
220219

221-
From the Kubernetes menu, we can see the IC and the RCP Cluster. We will enter the menu of the RCP Cluster -> Workloads -> Pods and we will see our container with a green indicator showing that it is running. Right-clicking on it will give us the option to "Attach to Visual Studio". Upon clicking, the editor will open in a new window within the container. We are then invited to open a folder, it should be our personal folder (`/mloscratch/users/$GASPAR`) by default, select it. When opening a new terminal, we should find ourselves directly in our personal folder, if needed we can move there with `cd` in the terminal. We can install new extensions on VS code, and they will be saved for future sessions.
220+
From the Kubernetes menu, we can see the IC and the RCP Cluster. We will enter the menu of the RCP Cluster -> Workloads -> Pods and we will see our container with a green indicator showing that it is running. Right-clicking on it will give us the option to "Attach to Visual Studio". Upon clicking, the editor will open in a new window within the container. We are then invited to open a folder, it should be our personal folder (`/lightscratch/users/$GASPAR`) by default, select it. When opening a new terminal, we should find ourselves directly in our personal folder, if needed we can move there with `cd` in the terminal. We can install new extensions on VS code, and they will be saved for future sessions.
222221

223222
### Windows (WSL connection)
224223

@@ -245,16 +244,16 @@ In **WSL**, claim a job and copy the kube configuration file from WSL to Windows
245244
# WSL terminal
246245
247246
runai submit \
248-
--name meditron-basic \
247+
--name base-job \
249248
--image registry.rcp.epfl.ch/multimeditron/basic:latest-$GASPAR\
250-
--pvc light-scratch:/mloscratch \
249+
--pvc light-scratch:/lightscratch \
251250
--large-shm \
252-
-e NAS_HOME=/mloscratch/users/$GASPAR \
253-
-e HF_API_KEY_FILE_AT=/mloscratch/users/$GASPAR/keys/hf_key.txt \
254-
-e WANDB_API_KEY_FILE_AT=/mloscratch/users/$GASPAR/keys/wandb_key.txt \
255-
-e GITCONFIG_AT=/mloscratch/users/$GASPAR/.gitconfig \
256-
-e GIT_CREDENTIALS_AT=/mloscratch/users/$GASPAR/.git-credentials \
257-
-e VSCODE_CONFIG_AT=/mloscratch/users/$GASPAR/.vscode-server \
251+
-e NAS_HOME=/lightscratch/users/$GASPAR \
252+
-e HF_API_KEY_FILE_AT=/lightscratch/users/$GASPAR/keys/hf_key.txt \
253+
-e WANDB_API_KEY_FILE_AT=/lightscratch/users/$GASPAR/keys/wandb_key.txt \
254+
-e GITCONFIG_AT=/lightscratch/users/$GASPAR/.gitconfig \
255+
-e GIT_CREDENTIALS_AT=/lightscratch/users/$GASPAR/.git-credentials \
256+
-e VSCODE_CONFIG_AT=/lightscratch/users/$GASPAR/.vscode-server \
258257
--backoff-limit 0 \
259258
--run-as-gid 84257 \
260259
--node-pool h100 \
@@ -267,7 +266,7 @@ cp ~/.kube/config /mnt/c/Users/$WINDOWS_USERNAME/.kube/config
267266
Open VSCode. Install this extension: https://marketplace.visualstudio.com/items?itemName=mtsmfm.vscode-k8s-quick-attach.
268267

269268
To attach VSCode to your job:
270-
Go to View -> Command Palette (or Ctrl+Shift+P), search for "k8s quick attach: Quick attach k8s Pod" -> rcp-caas -> runai-mlo-GASPAR -> meditron-basic-0-0 -> /mloscratch/users/$GASPAR_USER.
269+
Go to View -> Command Palette (or Ctrl+Shift+P), search for "k8s quick attach: Quick attach k8s Pod" -> rcp-caas -> runai-mlo-GASPAR -> meditron-basic-0-0 -> /lightscratch/users/$GASPAR_USER.
271270

272271
#### VSCode Troubleshooting
273272

0 commit comments

Comments
 (0)