Skip to content

Commit 8d8bab9

Browse files
authored
Add punctuation modifier (#1163)
* Update opus trainer * Use punct modifier * Add evaluation support * Update config generation * Update readme
1 parent 85ef370 commit 8d8bab9

File tree

17 files changed

+153
-133
lines changed

17 files changed

+153
-133
lines changed

docs/training/opus-trainer.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,9 @@ Data augmentation helps make translation models more robust, which is especially
1414
OpusTrainer augments data on the fly, meaning it will generate unique data for each epoch of training.
1515

1616
Supported augmentations:
17-
- **Upper case** - make some sentences from the dataset upper case
18-
- **Title case** - use title case for some sentences from the dataset
17+
- **UpperCase** - make some sentences from the dataset upper case
18+
- **TitleCase** - use title case for some sentences from the dataset
19+
- **RemoveEndPunct** - removes terminal punctuation mark from the source and target sentences if it matches by type (e.g. `.` and ``)
1920
- **Typos** - add random typos in some words
2021
- **Noise** - insert lines with random unicode noise
2122
- **Tags (inline noise)** - add emojis and other random Unicode symbols in the source and target sentences in the appropriate positions
@@ -80,6 +81,7 @@ finetune:
8081
modifiers:
8182
- UpperCase: 0.1 # Apply randomly to 10% of sentences
8283
- TitleCase: 0.1
84+
- RemoveEndPunct: 0.2
8385
- Typos: 0.05
8486
- Noise: 0.0005
8587
min_word_length: 2 # Minimum word length for each word in the noisy sentence
@@ -146,6 +148,8 @@ For example:
146148

147149
`aug-upper` - applies upper case to the whole dataset
148150

151+
`aug-punct` - applies modification of punctuation
152+
149153
`aug-noise` - generates extra lines with noise (1 line of noise for each line of the dataset, so the dataset becomes twice longer)
150154

151155
`aug-inline-noise` - inserts the same random noise in the appropriate positions of the source and target sentences based on dynamically generated alignments.
@@ -168,6 +172,7 @@ so it should only be used on small evaluation datasets.
168172
- flores_aug-mix_devtest
169173
- flores_aug-title_devtest
170174
- flores_aug-upper_devtest
175+
- flores_aug-punct_devtest
171176
- flores_aug-typos_devtest
172177
- flores_aug-noise_devtest
173178
- flores_aug-inline-noise_devtest

pipeline/data/parallel_importer.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020

2121
from opustrainer.modifiers.noise import NoiseModifier
2222
from opustrainer.modifiers.placeholders import PlaceholderTagModifier
23+
from opustrainer.modifiers.punctuation import RemoveEndPunctuationModifier
2324
from opustrainer.modifiers.surface import TitleCaseModifier, UpperCaseModifier
2425
from opustrainer.modifiers.typos import TypoModifier
2526
from opustrainer.types import Modifier
@@ -73,10 +74,12 @@ def get_typos_probs() -> Dict[str, float]:
7374
"aug-typos": lambda: TypoModifier(PROB_1, **get_typos_probs()),
7475
"aug-title": lambda: TitleCaseModifier(PROB_1),
7576
"aug-upper": lambda: UpperCaseModifier(PROB_1),
77+
"aug-punct": lambda: RemoveEndPunctuationModifier(PROB_1),
7678
"aug-noise": lambda: NoiseModifier(PROB_1),
7779
"aug-inline-noise": lambda: PlaceholderTagModifier(NOISE_PROB, augment=1),
7880
"aug-mix": lambda: CompositeModifier(
7981
[
82+
RemoveEndPunctuationModifier(MIX_PROB),
8083
TypoModifier(MIX_PROB, **get_typos_probs()),
8184
TitleCaseModifier(MIX_PROB),
8285
UpperCaseModifier(MIX_PROB),
@@ -86,6 +89,7 @@ def get_typos_probs() -> Dict[str, float]:
8689
),
8790
"aug-mix-cjk": lambda: CompositeModifier(
8891
[
92+
RemoveEndPunctuationModifier(MIX_PROB),
8993
NoiseModifier(MIX_PROB),
9094
PlaceholderTagModifier(NOISE_MIX_PROB, augment=1),
9195
]

pipeline/data/requirements/data.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
opustrainer==0.4
1+
opustrainer==0.5
22
simalign==0.4
33
mtdata==0.4.1
44
psutil==6.0.0

pipeline/data/requirements/data.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -489,9 +489,9 @@ opencc==1.1.9 \
489489
--hash=sha256:c6d5f9756ed08e67de36c53dc4d8f0bdc72889d6f57a8fc4d8b073d99c58d4dc \
490490
--hash=sha256:f4267b66ed6e656b5d8199f94e9673950ac39d49ebaf0e7927330801f06f038f
491491
# via -r pipeline/data/requirements/data.in
492-
opustrainer==0.4 \
493-
--hash=sha256:0bdf4adbabd0cdc4e73c99b36d01c0e69178e237adfd28293498b413e26c415c \
494-
--hash=sha256:bb973c52c7b4303e68ebc805cb8ad9e55518930131228a62ba112d2b2ab52ec6
492+
opustrainer==0.5 \
493+
--hash=sha256:d8533040747d23c128859d1948e464bbe991d2ae60fd036f416536680c8f08ea \
494+
--hash=sha256:e3c61b6ce1c3a7225b1ed927d317bc1d7c0314c0a7bb2278528f341d366ccc70
495495
# via -r pipeline/data/requirements/data.in
496496
packaging==24.1 \
497497
--hash=sha256:026ed72c8ed3fcce5bf8950572258698927fd1dbda10a5e981cdf0ac37f4f002 \

pipeline/train/configs/opustrainer/student.cjk.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ train:
1111
# The default values of the modifiers are taken from the paper https://arxiv.org/pdf/2311.14838.pdf
1212
# Please refer to docs/opus-trainer.md for further details
1313
modifiers:
14+
# Remove terminal punctuation to teach the model translate text without it
15+
- RemoveEndPunct: 0.2
1416
# Insert new sentences composed form Unicode noise
1517
- Noise: 0.0005
1618
min_word_length: 2 # Minimum word length for each word in the noisy sentence

pipeline/train/configs/opustrainer/student.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ modifiers:
1414
# boost upper case a little as we see that the models underperform on upper case dataset on evaluation
1515
- UpperCase: 0.07 # Apply randomly to 7% of sentences
1616
- TitleCase: 0.05
17+
# Remove terminal punctuation to teach the model translate text without it
18+
- RemoveEndPunct: 0.2
1719
# Introduce artificial typos in the source text
1820
- Typos: 0.05
1921
# Insert new sentences composed form Unicode noise

pipeline/train/configs/opustrainer/teacher.one-stage.cjk.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ train:
1818
# The default values of the modifiers are taken from the paper https://arxiv.org/pdf/2311.14838.pdf
1919
# Please refer to docs/opus-trainer.md for further details
2020
modifiers:
21+
# Remove terminal punctuation to teach the model translate text without it
22+
- RemoveEndPunct: 0.2
2123
## Insert new sentences composed form Unicode noise
2224
- Noise: 0.0005
2325
min_word_length: 2 # Minimum word length for each word in the noisy sentence

pipeline/train/configs/opustrainer/teacher.one-stage.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ train:
1515
# The default values of the modifiers are taken from the paper https://arxiv.org/pdf/2311.14838.pdf
1616
# Please refer to docs/opus-trainer.md for further details
1717
modifiers:
18+
# Remove terminal punctuation to teach the model translate text without it
19+
- RemoveEndPunct: 0.2
1820
# boost upper case a little as we see that the models underperform on upper case dataset on evaluation
1921
- UpperCase: 0.07 # Apply randomly to 7% of sentences
2022
- TitleCase: 0.05

pipeline/train/configs/opustrainer/teacher.two-stage.cjk.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ finetune:
2020
# The default values of the modifiers are taken from the paper https://arxiv.org/pdf/2311.14838.pdf
2121
# Please refer to docs/opus-trainer.md for further details
2222
modifiers:
23+
# Remove terminal punctuation to teach the model translate text without it
24+
- RemoveEndPunct: 0.2
2325
## Insert new sentences composed form Unicode noise
2426
- Noise: 0.0005
2527
min_word_length: 2 # Minimum word length for each word in the noisy sentence

pipeline/train/configs/opustrainer/teacher.two-stage.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ finetune:
2020
# The default values of the modifiers are taken from the paper https://arxiv.org/pdf/2311.14838.pdf
2121
# Please refer to docs/opus-trainer.md for further details
2222
modifiers:
23+
# Remove terminal punctuation to teach the model translate text without it
24+
- RemoveEndPunct: 0.2
2325
# boost upper case a little as we see that the models underperform on upper case dataset on evaluation
2426
- UpperCase: 0.07 # Apply randomly to 7% of sentences
2527
- TitleCase: 0.05

0 commit comments

Comments
 (0)