Skip to content

Commit a7e46d2

Browse files
authored
switch GPU workers to generic-worker & dockerize GPU tasks (#700)
* switch GPU workers to d2g images * Switch GPU tasks to run within docker The new image we're upgrading GPU workers to uses Ubuntu 24.04, which makes it incompatible with various parts of the pipeline (mostly due to Python package pinning). As it turns out, the easiest way to fix this is to dockerize the GPU tasks. We need slight updates to GPU task payloads to accommodate this. This will fix #391. * pull in cuda-toolkit in GPU tasks to make necessary libraries available Now that we're running inside a docker image we don't have these available on the filesystem already. I explored the idea of installing them into the Docker image, but it's quite impractical. The host image is Ubuntu 24.04, and the containers Ubuntu 22.04. We need to have a matching toolkit version, and we require version 12 at this point, which isn't available on 22.04. * enable cache for ~/.task-cache/pip Without this we end up with these files being inaccessible in subsequent tasks. * fix: add 'requests' to ctranslate2 requirements This has always been needed, but it was found on the host system on the previous image. * update interactivate task documentation All tasks now use the docker-worker format; no need to distinguish between GPU and non-GPU tasks. Also add a bit more detail about working in the container. * drop now-unused cuda-toolkit-11 toolchain * feat: use volume mount for artifacts This ensures artifacts will be uploaded if a spot termination happens.
1 parent d97a199 commit a7e46d2

File tree

49 files changed

+456
-232
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+456
-232
lines changed

docs/training/task-cluster.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -136,24 +136,16 @@ To start an interactive task, follow these steps:
136136

137137
5. Reduce the maxRunTime to a best guess at how long you'll need the task and worker running for. (We pay for every minute a worker runs - so they should not be kept running, eg: overnight.)
138138

139-
6. Adjust the payload to simply run bash and sleep (instead of a full pipeline step). For docker-worker tasks use something like:
139+
6. Adjust the payload to simply run bash and sleep (instead of a full pipeline step):
140140
```
141141
command:
142142
- bash
143143
- '-c'
144144
- 'sleep 7200'
145145
```
146146

147-
For generic-worker tasks (those needing a GPU), use:
148-
```
149-
command:
150-
- - bash
151-
- '-c'
152-
- 'sleep 7200'
153-
```
154-
155-
(docker-worker tasks have an `image` section in the payload)
156-
157147
7. Click "Create Task"
158148

159-
After a few minutes you should be able to get a shell (a link will show up in the tab when it's ready).
149+
After a few minutes you should be able to get a shell (a link will show up in the tab when it's ready). This shell should drop you inside of docker container as root, running the same image as the task you started this process with. Most tasks drop privileges to the `worker` user before doing any work, so you may want to run `su - worker` before doing anything of note.
150+
151+
When you are done with the worker you can use "Cancel" from the three dots menu to immediately shut it down. (This should happen within a few minutes of closing your last shell to the worker, but it's good practice to do it yourself to minimize costs.)

pipeline/bicleaner/bicleaner.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,6 @@ echo "###### Bicleaner filtering"
1313
test -v SRC
1414
test -v TRG
1515
test -v CUDA_DIR
16-
test -v CUDNN_DIR
17-
18-
# cuda and cudnn libs
19-
export LD_LIBRARY_PATH=${CUDA_DIR}/lib64:${CUDNN_DIR}:${LD_LIBRARY_PATH:+LD_LIBRARY_PATH:}
2016

2117
corpus_prefix=$1
2218
output_prefix=$2
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
ctranslate2==4.3.1
22
sentencepiece==0.2.0
33
gpustat==1.1.1
4+
requests==2.32.3

pipeline/translate/requirements/translate-ctranslate2.txt

Lines changed: 242 additions & 120 deletions
Large diffs are not rendered by default.

taskcluster/config.yml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -93,39 +93,39 @@ workers:
9393
worker-type: 'b-linux-large-gcp-1tb-64-512-std-d2g'
9494
b-linux-v100-gpu:
9595
provisioner: '{trust-domain}-{level}'
96-
implementation: generic-worker
96+
implementation: docker-worker
9797
os: linux
98-
worker-type: '{alias}'
98+
worker-type: 'b-linux-v100-gpu-d2g'
9999
b-linux-v100-gpu-4:
100100
provisioner: '{trust-domain}-{level}'
101-
implementation: generic-worker
101+
implementation: docker-worker
102102
os: linux
103-
worker-type: '{alias}'
103+
worker-type: 'b-linux-v100-gpu-d2g-4'
104104
b-linux-v100-gpu-4-300gb:
105105
provisioner: '{trust-domain}-{level}'
106-
implementation: generic-worker
106+
implementation: docker-worker
107107
os: linux
108-
worker-type: '{alias}'
108+
worker-type: 'b-linux-v100-gpu-d2g-4-300gb'
109109
b-linux-v100-gpu-4-300gb-standard:
110110
provisioner: '{trust-domain}-{level}'
111-
implementation: generic-worker
111+
implementation: docker-worker
112112
os: linux
113-
worker-type: '{alias}'
113+
worker-type: 'b-linux-v100-gpu-d2g-4-300gb-standard'
114114
b-linux-v100-gpu-4-1tb:
115115
provisioner: '{trust-domain}-{level}'
116-
implementation: generic-worker
116+
implementation: docker-worker
117117
os: linux
118-
worker-type: '{alias}'
118+
worker-type: 'b-linux-v100-gpu-d2g-4-1tb'
119119
b-linux-v100-gpu-4-2tb:
120120
provisioner: '{trust-domain}-{level}'
121-
implementation: generic-worker
121+
implementation: docker-worker
122122
os: linux
123-
worker-type: '{alias}'
123+
worker-type: 'b-linux-v100-gpu-d2g-4-2tb'
124124
b-linux-v100-gpu-4-1tb-standard:
125125
provisioner: '{trust-domain}-{level}'
126-
implementation: generic-worker
126+
implementation: docker-worker
127127
os: linux
128-
worker-type: '{alias}'
128+
worker-type: 'b-linux-v100-gpu-d2g-4-1tb-standard'
129129
images:
130130
provisioner: '{trust-domain}-{level}'
131131
implementation: docker-worker

taskcluster/docker/base/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,6 @@ ENV SHELL=/bin/bash \
5959
TERM=hterm-256color
6060

6161
VOLUME /builds/worker/checkouts
62-
VOLUME /builds/worker/.cache
62+
VOLUME /builds/worker/.task-cache/pip
6363

6464
USER root

taskcluster/docker/inference/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,6 @@ ENV SHELL=/bin/bash \
6868
PATH="/builds/worker/.local/bin:$PATH"
6969

7070
VOLUME /builds/worker/checkouts
71-
VOLUME /builds/worker/.cache
71+
VOLUME /builds/worker/.task-cache/pip
7272

7373
USER root

taskcluster/docker/test/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,4 @@ RUN curl -sSLf "https://github.com/go-task/task/releases/download/v3.35.1/task_l
2727
| tar -xz -C /usr/local/bin
2828

2929
VOLUME /builds/worker/checkouts
30-
VOLUME /builds/worker/.cache
30+
VOLUME /builds/worker/.task-cache/pip

taskcluster/docker/toolchain-build/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,6 @@ ENV SHELL=/bin/bash \
4545
PATH="/builds/worker/.local/bin:$PATH"
4646

4747
VOLUME /builds/worker/checkouts
48-
VOLUME /builds/worker/.cache
48+
VOLUME /builds/worker/.task-cache/pip
4949

5050
USER root

taskcluster/docker/train/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,5 @@ RUN apt-get update -qq \
1616
wget \
1717
&& apt-get clean
1818

19-
2019
VOLUME /builds/worker/checkouts
20+
VOLUME /builds/worker/.task-cache/pip

0 commit comments

Comments
 (0)