You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/training/opus-trainer.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,8 +14,9 @@ Data augmentation helps make translation models more robust, which is especially
14
14
OpusTrainer augments data on the fly, meaning it will generate unique data for each epoch of training.
15
15
16
16
Supported augmentations:
17
-
-**Upper case** - make some sentences from the dataset upper case
18
-
-**Title case** - use title case for some sentences from the dataset
17
+
-**UpperCase** - make some sentences from the dataset upper case
18
+
-**TitleCase** - use title case for some sentences from the dataset
19
+
-**RemoveEndPunct** - removes terminal punctuation mark from the source and target sentences if it matches by type (e.g. `.` and `。`)
19
20
-**Typos** - add random typos in some words
20
21
-**Noise** - insert lines with random unicode noise
21
22
-**Tags (inline noise)** - add emojis and other random Unicode symbols in the source and target sentences in the appropriate positions
@@ -80,6 +81,7 @@ finetune:
80
81
modifiers:
81
82
- UpperCase: 0.1 # Apply randomly to 10% of sentences
82
83
- TitleCase: 0.1
84
+
- RemoveEndPunct: 0.2
83
85
- Typos: 0.05
84
86
- Noise: 0.0005
85
87
min_word_length: 2 # Minimum word length for each word in the noisy sentence
@@ -146,6 +148,8 @@ For example:
146
148
147
149
`aug-upper`- applies upper case to the whole dataset
148
150
151
+
`aug-punct`- applies modification of punctuation
152
+
149
153
`aug-noise`- generates extra lines with noise (1 line of noise for each line of the dataset, so the dataset becomes twice longer)
150
154
151
155
`aug-inline-noise`- inserts the same random noise in the appropriate positions of the source and target sentences based on dynamically generated alignments.
@@ -168,6 +172,7 @@ so it should only be used on small evaluation datasets.
0 commit comments