Skip to content

Commit 4577c11

Browse files
matt-borisbhearsum
andauthored
feat: reland new GPU worker images and dockerization (#1154)
* feat: reland new GPU worker images and dockerization (#1131) This is a backout of #1120, except for the bits that were re-landed in #1123. It effectively gets us back to where we were re: GPU workers and dockerization prior to #1120. * bump to cuda v12.9.0 --------- Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
1 parent 032ea1d commit 4577c11

File tree

19 files changed

+366
-201
lines changed

19 files changed

+366
-201
lines changed

docs/training/task-cluster.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -139,24 +139,16 @@ To start an interactive task, follow these steps:
139139

140140
5. Reduce the maxRunTime to a best guess at how long you'll need the task and worker running for. (We pay for every minute a worker runs - so they should not be kept running, eg: overnight.)
141141

142-
6. Adjust the payload to simply run bash and sleep (instead of a full pipeline step). For docker-worker tasks use something like:
142+
6. Adjust the payload to simply run bash and sleep (instead of a full pipeline step):
143143
```
144144
command:
145145
- bash
146146
- '-c'
147147
- 'sleep 7200'
148148
```
149149

150-
For generic-worker tasks (those needing a GPU), use:
151-
```
152-
command:
153-
- - bash
154-
- '-c'
155-
- 'sleep 7200'
156-
```
157-
158-
(docker-worker tasks have an `image` section in the payload)
159-
160150
7. Click "Create Task"
161151

162-
After a few minutes you should be able to get a shell (a link will show up in the tab when it's ready).
152+
After a few minutes you should be able to get a shell (a link will show up in the tab when it's ready). This shell should drop you inside of docker container as root, running the same image as the task you started this process with. Most tasks drop privileges to the `worker` user before doing any work, so you may want to run `su - worker` before doing anything of note.
153+
154+
When you are done with the worker you can use "Cancel" from the three dots menu to immediately shut it down. (This should happen within a few minutes of closing your last shell to the worker, but it's good practice to do it yourself to minimize costs.)

pipeline/bicleaner/bicleaner.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,6 @@ python3 -c "from pipeline.common.marian import assert_gpus_available; assert_gpu
1616
test -v SRC
1717
test -v TRG
1818
test -v CUDA_DIR
19-
test -v CUDNN_DIR
20-
21-
# cuda and cudnn libs
22-
export LD_LIBRARY_PATH=${CUDA_DIR}/lib64:${CUDNN_DIR}:${LD_LIBRARY_PATH:+LD_LIBRARY_PATH:}
2319

2420
corpus_prefix=$1
2521
output_prefix=$2
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
ctranslate2==4.3.1
22
sentencepiece==0.2.0
33
gpustat==1.1.1
4+
requests==2.32.3

pipeline/translate/requirements/translate-ctranslate2.txt

Lines changed: 242 additions & 120 deletions
Large diffs are not rendered by default.

taskcluster/config.yml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -97,39 +97,39 @@ workers:
9797
worker-type: 'b-linux-large-gcp-1tb-64-512-std-d2g'
9898
b-linux-v100-gpu:
9999
provisioner: '{trust-domain}-{level}'
100-
implementation: generic-worker
100+
implementation: docker-worker
101101
os: linux
102-
worker-type: '{alias}'
102+
worker-type: 'b-linux-v100-gpu-d2g'
103103
b-linux-v100-gpu-4:
104104
provisioner: '{trust-domain}-{level}'
105-
implementation: generic-worker
105+
implementation: docker-worker
106106
os: linux
107-
worker-type: '{alias}'
107+
worker-type: 'b-linux-v100-gpu-d2g-4'
108108
b-linux-v100-gpu-4-300gb:
109109
provisioner: '{trust-domain}-{level}'
110-
implementation: generic-worker
110+
implementation: docker-worker
111111
os: linux
112-
worker-type: '{alias}'
112+
worker-type: 'b-linux-v100-gpu-d2g-4-300gb'
113113
b-linux-v100-gpu-4-300gb-standard:
114114
provisioner: '{trust-domain}-{level}'
115-
implementation: generic-worker
115+
implementation: docker-worker
116116
os: linux
117-
worker-type: '{alias}'
117+
worker-type: 'b-linux-v100-gpu-d2g-4-300gb-standard'
118118
b-linux-v100-gpu-4-1tb:
119119
provisioner: '{trust-domain}-{level}'
120-
implementation: generic-worker
120+
implementation: docker-worker
121121
os: linux
122-
worker-type: '{alias}'
122+
worker-type: 'b-linux-v100-gpu-d2g-4-1tb'
123123
b-linux-v100-gpu-4-2tb:
124124
provisioner: '{trust-domain}-{level}'
125-
implementation: generic-worker
125+
implementation: docker-worker
126126
os: linux
127-
worker-type: '{alias}'
127+
worker-type: 'b-linux-v100-gpu-d2g-4-2tb'
128128
b-linux-v100-gpu-4-1tb-standard:
129129
provisioner: '{trust-domain}-{level}'
130-
implementation: generic-worker
130+
implementation: docker-worker
131131
os: linux
132-
worker-type: '{alias}'
132+
worker-type: 'b-linux-v100-gpu-d2g-4-1tb-standard'
133133
images:
134134
provisioner: '{trust-domain}-{level}'
135135
implementation: docker-worker

taskcluster/kinds/backtranslations-mono-trg-translate/kind.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,11 +101,14 @@ tasks:
101101
from-parameters: training_config.marian-args.decoding-backward
102102
worker-type: b-largegpu
103103
worker:
104+
docker-image: {"in-tree": "train"}
104105
max-run-time: 2592000
106+
volumes:
107+
- /builds/worker/artifacts
105108
artifacts:
106109
- name: public/build
107-
path: artifacts
108-
type: directory
110+
path: /builds/worker/artifacts
111+
type: volume
109112
env:
110113
CUDA_DIR: fetches/cuda-toolkit
111114
CUDNN_DIR: fetches/cuda-toolkit
@@ -135,6 +138,7 @@ tasks:
135138
pip3 install -U pip==25.0.1 &&
136139
pip3 install -r $VCS_PATH/pipeline/translate/requirements/translate-ctranslate2.txt &&
137140
export PYTHONPATH=$PYTHONPATH:$VCS_PATH &&
141+
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$MOZ_FETCHES_DIR/cuda-toolkit/lib64" &&
138142
python3 $VCS_PATH/pipeline/translate/translate.py
139143
--input "$MOZ_FETCHES_DIR/file.{{this_chunk}}.zst"
140144
--models_glob "$MOZ_FETCHES_DIR/*.npz" "$MOZ_FETCHES_DIR/model*/*.npz"

taskcluster/kinds/backtranslations-train-backwards-model/kind.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ tasks:
7474
cron: b-largegpu
7575
default: b-largegpu-largedisk
7676
worker:
77+
docker-image: {"in-tree": "train"}
7778
max-run-time: 2592000
7879
# train_taskcluster.py exits with 17 if a request to Taskcluster fails
7980
# 75 - EX_TEMPFAIL, used for when the GPUs aren't available on the machine.
@@ -86,10 +87,12 @@ tasks:
8687

8788
# Weight & Biases publication token is stored in that secret
8889
TASKCLUSTER_SECRET: project/translations/level-1/weights-and-biases
90+
volumes:
91+
- /builds/worker/artifacts
8992
artifacts:
9093
- name: public/build
91-
path: artifacts
92-
type: directory
94+
path: /builds/worker/artifacts
95+
type: volume
9396

9497
# Taskcluster proxy is required to read secrets
9598
taskcluster-proxy: true
@@ -118,6 +121,7 @@ tasks:
118121
export MARIAN=$MOZ_FETCHES_DIR &&
119122
export MOZ_FETCHES_DIR=$MOZ_FETCHES_DIR &&
120123
export PYTHONPATH=$PYTHONPATH:$VCS_PATH &&
124+
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$MOZ_FETCHES_DIR/cuda-toolkit/lib64" &&
121125
$VCS_PATH/taskcluster/scripts/pipeline/train_taskcluster.py
122126
backward
123127
train
@@ -143,6 +147,7 @@ tasks:
143147
fetches:
144148
toolchain:
145149
- marian
150+
- cuda-toolkit
146151
corpus-merge-parallel:
147152
- artifact: corpus.{src_locale}.zst
148153
extract: false

taskcluster/kinds/corpus-clean-parallel-bicleaner-ai/kind.yml

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -67,17 +67,18 @@ tasks:
6767

6868
worker-type: b-largegpu-largedisk
6969
worker:
70+
docker-image: {"in-tree": "train"}
71+
volumes:
72+
- /builds/worker/artifacts
7073
artifacts:
7174
- name: public/build
72-
path: artifacts
73-
type: directory
75+
path: /builds/worker/artifacts
76+
type: volume
7477
# 7 days. yes, it can take a while to clean a huge dataset
7578
max-run-time: 604800
7679
env:
7780
SRC: "{src_locale}"
7881
TRG: "{trg_locale}"
79-
CUDA_DIR: fetches/cuda-toolkit
80-
CUDNN_DIR: fetches/cuda-toolkit
8182
# 128 happens when cloning this repository fails
8283
# 75 is the unix code EX_TEMPFAIL, which indicates a temporary failure.
8384
# This is used when the GPUs can't be accessed. Bicleaner reverts to CPU
@@ -104,7 +105,10 @@ tasks:
104105
pip install $MOZ_FETCHES_DIR/kenlm-0.0.0-cp310-cp310-linux_x86_64.whl &&
105106
pip install -r {bicleaner_reqs} &&
106107
export PYTHONPATH=$PYTHONPATH:$VCS_PATH &&
107-
export PATH=$PATH:~/.local/bin &&
108+
export CUDA_DIR="$MOZ_FETCHES_DIR/cuda-toolkit" &&
109+
export PATH="$PATH:~/.local/bin:$CUDA_DIR/bin" &&
110+
export XLA_FLAGS="--xla_gpu_cuda_data_dir=$CUDA_DIR" &&
111+
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$MOZ_FETCHES_DIR/cuda-toolkit/lib64:$MOZ_FETCHES_DIR/cudnn/lib" &&
108112
$VCS_PATH/pipeline/bicleaner/bicleaner.sh
109113
$MOZ_FETCHES_DIR/{dataset_sanitized}
110114
$TASK_WORKDIR/artifacts/{dataset_sanitized}
@@ -115,12 +119,14 @@ tasks:
115119
"{provider}": corpus-clean-parallel-{provider}-{dataset_sanitized}-{src_locale}-{trg_locale}
116120
corpus-clean-parallel-fetch-bicleaner-model: corpus-clean-parallel-fetch-bicleaner-model-{src_locale}-{trg_locale}
117121
fetches:
122+
fetch:
123+
- cudnn
118124
toolchain:
119125
- artifact: cyhunspell
120126
extract: false
121127
- artifact: kenlm
122128
extract: false
123-
- cuda-toolkit-11
129+
- cuda-toolkit
124130
"{provider}":
125131
- artifact: "{dataset_sanitized}.{src_locale}.zst"
126132
extract: false

taskcluster/kinds/distillation-mono-src-translate/kind.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,11 +101,14 @@ tasks:
101101

102102
worker-type: b-largegpu
103103
worker:
104+
docker-image: {"in-tree": "train"}
104105
max-run-time: 2592000
106+
volumes:
107+
- /builds/worker/artifacts
105108
artifacts:
106109
- name: public/build
107-
path: artifacts
108-
type: directory
110+
path: /builds/worker/artifacts
111+
type: volume
109112
env:
110113
CUDA_DIR: fetches/cuda-toolkit
111114
CUDNN_DIR: fetches/cuda-toolkit
@@ -136,6 +139,7 @@ tasks:
136139
pip3 install -U pip==25.0.1 &&
137140
pip3 install -r $VCS_PATH/pipeline/translate/requirements/translate-ctranslate2.txt &&
138141
export PYTHONPATH=$PYTHONPATH:$VCS_PATH &&
142+
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$MOZ_FETCHES_DIR/cuda-toolkit/lib64" &&
139143
python3 $VCS_PATH/pipeline/translate/translate.py
140144
--input "$MOZ_FETCHES_DIR/file.{{this_chunk}}.zst"
141145
--models_glob "$MOZ_FETCHES_DIR/*.npz" "$MOZ_FETCHES_DIR/model*/*.npz"
@@ -152,3 +156,4 @@ tasks:
152156
fetches:
153157
toolchain:
154158
- marian
159+
- cuda-toolkit

taskcluster/kinds/distillation-parallel-src-translate/kind.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -102,11 +102,14 @@ tasks:
102102

103103
worker-type: b-largegpu-xlargedisk
104104
worker:
105+
docker-image: {"in-tree": "train"}
105106
max-run-time: 2592000
107+
volumes:
108+
- /builds/worker/artifacts
106109
artifacts:
107110
- name: public/build
108-
path: artifacts
109-
type: directory
111+
path: /builds/worker/artifacts
112+
type: volume
110113
env:
111114
CUDA_DIR: fetches/cuda-toolkit
112115
CUDNN_DIR: fetches/cuda-toolkit
@@ -133,6 +136,7 @@ tasks:
133136
pip3 install -U pip==25.0.1 &&
134137
pip3 install -r $VCS_PATH/pipeline/translate/requirements/translate-ctranslate2.txt &&
135138
export PYTHONPATH=$PYTHONPATH:$VCS_PATH &&
139+
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$MOZ_FETCHES_DIR/cuda-toolkit/lib64" &&
136140
python3 $VCS_PATH/pipeline/translate/translate.py
137141
--input "$MOZ_FETCHES_DIR/file.{{this_chunk}}.zst"
138142
--models_glob "$MOZ_FETCHES_DIR/*.npz" "$MOZ_FETCHES_DIR/model*/*.npz"

0 commit comments

Comments
 (0)