Skip to content

Commit 361ac5a

Browse files
authored
Split vocabs support (#1051)
* Split vocabs 1 * Split vocabs 2 * Fix export * Fix test * Add an option for split vocabs * Remove todo * Output a joint vocab * Output a joint vocab * Update docs * Handle tied embeddings * Update OpusTrainer * Use tuple * Fix formatting * Use vocabs content for comparison * Support single vocab for using pretrained models * Fix linter issue * Add logging * Add extra export and var check * Use artifacts dir * Disable split vocabs * Run linter * Fix configuration * Minor fixes * Enable split vocabs
1 parent 26402fb commit 361ac5a

File tree

61 files changed

+1300
-1013
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1300
-1013
lines changed

Taskfile.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ tasks:
6565
summary: |
6666
The models will be saved to: ./data/taskcluster-model
6767
Example: `task config-generator -- en fi`
68-
deps: [poetry-install-utils]
68+
deps: [poetry-install-utils, poetry-install-utils-docker]
6969
cmds:
7070
- >-
7171
PYTHONPATH=$(pwd) poetry run python -W ignore utils/config_generator.py {{.CLI_ARGS}}

docs/training/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,18 @@ for example [teacher.train.yml](https://github.com/mozilla/translations/tree/mai
192192

193193
### Model training
194194

195+
#### Vocabulary
196+
197+
Use separate SentencePiece vocabularies for source and target languages if they have different scripts (for example, Latin and Cyrillic).
198+
```yaml
199+
spm-vocab-split: true
200+
```
201+
202+
The default size of SentencePiece vocabulary is 32k, increase to 64k when using a joint vocabulary for CJK languages.
203+
```yaml
204+
spm-vocab-size: 64000
205+
```
206+
195207
#### Teacher ensemble
196208

197209
Change to 1 not to use an ensemble of two teachers. The ensemble is more expensive to train and run decoding for,

docs/training/opus-trainer.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,8 @@ modifiers:
9090
custom_detok_trg: "icu:{trg}"
9191
augment: 1
9292
tag: 0
93-
spm_vocab: {vocab}
93+
spm_vocab_src: {vocab_src}
94+
spm_vocab_trg: {vocab_trg}
9495
seed: 1111
9596
9697
# parallel sentences + token alignments
@@ -101,8 +102,8 @@ num_fields: 3
101102

102103
`Tags` modifiers requires whitespace, Moses or ICU tokenized alignments as input.
103104
Marian requires Sentencepiece tokenized alignments and raw text input.
104-
To make them compatible `Tags` modifier can remap the alignments in the end using the passed Sentencepiece model `spm_vocab: vocab.spm` (student model use case).
105-
If the `spm_vocab` argument is missing `Tags` modifier will remove alignments and output only the parallel sentences (teacher model use case).
105+
To make them compatible `Tags` modifier can remap the alignments in the end using the passed Sentencepiece model `spm_vocab_*: vocab.spm` (student model use case).
106+
If the `spm_vocab_trg` argument is missing `Tags` modifier will remove alignments and output only the parallel sentences (teacher model use case).
106107

107108
Currently, ICUs-tokenized text and its alignments are passed to OpusTrainer (to work around CJK languages where whitespace-based tokenization doesn't make sense).
108109
Whitespaces are represented with a special symbol "▁" to allow for lossless text reconstruction on OpusTrainer side.

pipeline/alignments/generate-shortlist.sh

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,10 @@ echo "###### Generating alignments and shortlist"
1616
[[ -z "${TRG}" ]] && echo "TRG is empty"
1717

1818
corpus_prefix=$1
19-
vocab_path=$2
20-
output_dir=$3
21-
threads=$4
19+
vocab_src=$2
20+
vocab_trg=$3
21+
output_dir=$4
22+
threads=$5
2223

2324
if [ "$threads" = "auto" ]; then
2425
threads=$(nproc)
@@ -36,11 +37,11 @@ corpus_trg="${corpus_prefix}.${TRG}.zst"
3637

3738
echo "### Subword segmentation with SentencePiece"
3839
zstdmt -dc "${corpus_src}" |
39-
parallel --no-notice --pipe -k -j "${threads}" --block 50M "${MARIAN}/spm_encode" --model "${vocab_path}" \
40+
parallel --no-notice --pipe -k -j "${threads}" --block 50M "${MARIAN}/spm_encode" --model "${vocab_src}" \
4041
>"${dir}/corpus.spm.${SRC}"
4142

4243
zstdmt -dc "${corpus_trg}" |
43-
parallel --no-notice --pipe -k -j "${threads}" --block 50M "${MARIAN}/spm_encode" --model "${vocab_path}" \
44+
parallel --no-notice --pipe -k -j "${threads}" --block 50M "${MARIAN}/spm_encode" --model "${vocab_trg}" \
4445
>"${dir}/corpus.spm.${TRG}"
4546

4647
python3 align.py \
@@ -65,7 +66,7 @@ rm "${dir}/corpus.spm.${SRC}"
6566
rm "${output_dir}/corpus.aln"
6667

6768
echo "### Shortlist pruning"
68-
"${MARIAN}/spm_export_vocab" --model="${vocab_path}" --output="${dir}/vocab.txt"
69+
"${MARIAN}/spm_export_vocab" --model="${vocab_trg}" --output="${dir}/vocab.txt"
6970
zstdmt -dc "${dir}/lex.s2t.zst" |
7071
grep -v NULL |
7172
python3 "prune_shortlist.py" 100 "${dir}/vocab.txt" |

pipeline/cefilter/score.sh

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,10 @@ test -v TRG
1515
test -v WORKSPACE
1616

1717
model=$1
18-
vocab=$2
19-
corpus_prefix=$3
20-
output=$4
18+
vocab_src=$2
19+
vocab_trg=$3
20+
corpus_prefix=$4
21+
output=$5
2122

2223
zstdmt --rm -d "${corpus_prefix}.${SRC}.zst"
2324
zstdmt --rm -d "${corpus_prefix}.${TRG}.zst"
@@ -27,7 +28,7 @@ mkdir -p "${dir}"
2728

2829
"${MARIAN}/marian-scorer" \
2930
--model "${model}" \
30-
--vocabs "${vocab}" "${vocab}" \
31+
--vocabs "${vocab_src}" "${vocab_trg}" \
3132
--train-sets "${corpus_prefix}.${TRG}" "${corpus_prefix}.${SRC}" \
3233
--mini-batch 32 \
3334
--mini-batch-words 1500 \

pipeline/common/command_runner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def apply_command_args(dict: dict[str, any]):
3737
if value is None:
3838
continue
3939

40-
if isinstance(value, list):
40+
if isinstance(value, (list, tuple)):
4141
for v in value:
4242
yield str(v)
4343
continue

pipeline/data/requirements/data.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
opustrainer==0.3
1+
opustrainer==0.4
22
simalign==0.4
33
mtdata==0.4.1
44
psutil==6.0.0

pipeline/data/requirements/data.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -489,9 +489,9 @@ opencc==1.1.9 \
489489
--hash=sha256:c6d5f9756ed08e67de36c53dc4d8f0bdc72889d6f57a8fc4d8b073d99c58d4dc \
490490
--hash=sha256:f4267b66ed6e656b5d8199f94e9673950ac39d49ebaf0e7927330801f06f038f
491491
# via -r pipeline/data/requirements/data.in
492-
opustrainer==0.3 \
493-
--hash=sha256:75d10317ccf92c4ac8618debe23fe35d02b364ed69bd80c7815035c7d10dc5ad \
494-
--hash=sha256:acf7050550d08409c12b634e26d1cee279aea8534161214232e6a826715f8a21
492+
opustrainer==0.4 \
493+
--hash=sha256:0bdf4adbabd0cdc4e73c99b36d01c0e69178e237adfd28293498b413e26c415c \
494+
--hash=sha256:bb973c52c7b4303e68ebc805cb8ad9e55518930131228a62ba112d2b2ab52ec6
495495
# via -r pipeline/data/requirements/data.in
496496
packaging==24.1 \
497497
--hash=sha256:026ed72c8ed3fcce5bf8950572258698927fd1dbda10a5e981cdf0ac37f4f002 \

pipeline/eval/eval.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -117,10 +117,16 @@ def main(args_list: Optional[list[str]] = None) -> None:
117117
help="The Marian model (or models if its an ensemble) to use for translations",
118118
)
119119
parser.add_argument(
120-
"--vocab",
120+
"--vocab_src",
121121
required=False,
122122
type=str,
123-
help="The path to a vocab file (optional)",
123+
help="The path to a src vocab file (optional)",
124+
)
125+
parser.add_argument(
126+
"--vocab_trg",
127+
required=False,
128+
type=str,
129+
help="The path to a trg vocab file (optional)",
124130
)
125131
parser.add_argument(
126132
"--shortlist",
@@ -176,9 +182,8 @@ def main(args_list: Optional[list[str]] = None) -> None:
176182
elif not args.model_variant == "cpu":
177183
raise Exception(f"Unsupported model variant {args.model_variant}")
178184

179-
if args.vocab:
180-
# Pass in the vocab twice as it's shared between the source and the target.
181-
marian_extra_args = [*marian_extra_args, "--vocabs", args.vocab, args.vocab]
185+
if args.vocab_src and args.vocab_trg:
186+
marian_extra_args = [*marian_extra_args, "--vocabs", args.vocab_src, args.vocab_trg]
182187

183188
if args.shortlist:
184189
# The final "false" argument tells Marian not to verify the correctness of the shortlist.

pipeline/quantize/export.sh

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,9 @@ test -v BMT_MARIAN
1717

1818
model_dir=$1
1919
shortlist=$2
20-
vocab=$3
21-
output_dir=$4
20+
vocab_src=$3
21+
vocab_trg=$4
22+
output_dir=$5
2223

2324
mkdir -p "${output_dir}"
2425

@@ -30,13 +31,22 @@ shortlist_bin="${output_dir}/lex.50.50.${SRC}${TRG}.s2t.bin"
3031
"${BMT_MARIAN}"/marian-conv \
3132
--shortlist "${shortlist}" 50 50 0 \
3233
--dump "${shortlist_bin}" \
33-
--vocabs "${vocab}" "${vocab}"
34+
--vocabs "${vocab_src}" "${vocab_trg}"
3435
pigz "${shortlist_bin}"
3536

36-
vocab_out="${output_dir}/vocab.${SRC}${TRG}.spm"
37-
cp "${vocab}" "${vocab_out}"
38-
pigz "${vocab_out}"
39-
37+
if cmp --silent "${vocab_src}" "${vocab_trg}"; then
38+
echo "Vocab files are identical, output a joint vocab"
39+
vocab_out="${output_dir}/vocab.${SRC}${TRG}.spm"
40+
cp "${vocab_src}" "${vocab_out}"
41+
pigz "${vocab_out}"
42+
else
43+
vocab_src_out="${output_dir}/srcvocab.${SRC}${TRG}.spm"
44+
vocab_trg_out="${output_dir}/trgvocab.${SRC}${TRG}.spm"
45+
cp "${vocab_src}" "${vocab_src_out}"
46+
cp "${vocab_trg}" "${vocab_trg_out}"
47+
pigz "${vocab_src_out}"
48+
pigz "${vocab_trg_out}"
49+
fi
4050

4151
echo "### Export is completed. Results: ${output_dir}"
4252

0 commit comments

Comments
 (0)