Skip to content

Commit 19ec4b0

Browse files
authored
Add url mismatch filter + fixes and cleanup (#1157)
* Remove experimental custom filters * Add url mismatch filter * Fix zh filters * Update opus cleaner * Remove old cleaning * Add opus cleaner test * Fix filter name
1 parent b765f6f commit 19ec4b0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1717
-1856
lines changed

docs/data-and-cleaning/index.md

Lines changed: 18 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -2,49 +2,11 @@
22

33
Making datasets less noisy to improve quality of translation.
44

5-
## Regular pipeline
6-
7-
Config setting:
8-
```
9-
use-opuscleaner: false
10-
```
11-
12-
### Dataset fixing
13-
14-
Some datasets require fixes like detokenization.
15-
Dataset and language specific fixes are implemented in [https://github.com/mozilla/translations/tree/main/pipeline/clean/fixes](https://github.com/mozilla/translations/tree/main/pipeline/clean/fixes).
16-
Naming convention:
17-
- `<dataset_name>.sh` for parallel dataset cleaning
18-
- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
19-
- `/` in dataset name should be replaced with `_`
20-
21-
### Cleaning scripts
22-
23-
Make sure the language is present in [clean_parallel](https://github.com/mozilla/translations/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.
24-
25-
26-
### Bicleaner
27-
28-
It is recommended to use Bicleaner ML models to filter noisy data.
29-
See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.
30-
31-
325
## OpusCleaner
336

34-
Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
35-
36-
Config setting:
37-
```
38-
use-opuscleaner: "true"
39-
```
40-
41-
To enable custom per-dataset filter configs add:
42-
```
43-
opuscleaner-mode: "custom"
44-
```
45-
7+
We use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
468

47-
## Custom filter configs
9+
### Custom filter configs
4810

4911
The idea behind OpusCleaner is customizing filter rules for each language pair and dataset
5012
to get a training corpus with less noise and train higher quality translation models.
@@ -74,9 +36,12 @@ or to
7436

7537
`pipeline/clean/opuscleaner/configs/` for dataset specific filters that will apply to all language pairs.
7638

77-
Make sure to replace the language codes to the template values `<src>` and `<trg>`. See examples in the directory.
39+
Make sure to replace the language codes to the template values `<src>` and `<trg>` and remove absolutes paths from the `"files"` section.
40+
See examples in the directory.
7841

79-
### Default config
42+
### Default configs
43+
44+
Set `opuscleaner-mode: custom` in the training config to use custom per-dataset and per-language pair configs.
8045

8146
If no custom config was specified for the dataset,
8247
the [default config template](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
@@ -85,7 +50,15 @@ Modify if needed. Some rules require specifying source or target language.
8550
The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
8651
The generated default config will be copied to the target dataset cleaning directory.
8752

88-
### Running
53+
The config is chosen based on this search order:
54+
1. Dataset and language specific: `configs/<language-pair>/<dataset>.filter.json`
55+
2. Language specific: `configs/<language-pair>/default.filter.json`
56+
3. Dataset specific: `configs/<dataset>.filter.json`
57+
4. Default: `configs/default.filter.json`
58+
59+
The first found config will be applied.
60+
61+
## Bicleaner
8962

90-
Enable OpusCleaner in the training pipeline config and run the pipeline as usual.
91-
OpusCleaner will replace the default [corpus-clean-parallel](https://github.com/mozilla/translations/tree/main/pipeline/clean/clean-parallel.sh) script.
63+
It is recommended to use Bicleaner ML models to filter noisy data.
64+
See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.

docs/training/README.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -155,12 +155,7 @@ For example:
155155

156156
## 4. Configure data cleaning
157157

158-
The default configuration should work.
159-
160-
OpusCleaner is enabled by default:
161-
```yaml
162-
use-opuscleaner: true
163-
```
158+
The default configuration should work.
164159

165160
Enable OpusCleaner [custom cleaning rules](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs) if you want to use them for your language pair.
166161
```yaml

pipeline/clean/clean-parallel.sh

Lines changed: 0 additions & 114 deletions
This file was deleted.

pipeline/clean/opuscleaner/clean-parallel.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,9 @@ mkdir -p "${dir}"
2828
echo "Downloading FastText model."
2929
# pre-download fast text model as it's causing constant issues
3030
filters_dir="/builds/worker/.local/lib/python3.10/site-packages/opuscleaner/filters"
31-
wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
31+
if [ -d ${filters_dir} ]; then
32+
wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
33+
fi
3234

3335
echo "### Generating cleaning config: ${dataset}.${SRC}-${TRG}.filters.json"
3436
# save new filter to dataset output dir

pipeline/clean/opuscleaner/configs/default.filters.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,13 @@
100100
},
101101
"language": null
102102
},
103+
{
104+
"filter": "url_mismatch",
105+
"parameters": {
106+
"DEBUG": false
107+
},
108+
"language": null
109+
},
103110
{
104111
"filter": "fasttext_filter",
105112
"parameters": {

pipeline/clean/opuscleaner/configs/en-ja/default.filters.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,13 @@
119119
},
120120
"language": null
121121
},
122+
{
123+
"filter": "url_mismatch",
124+
"parameters": {
125+
"DEBUG": false
126+
},
127+
"language": null
128+
},
122129
{
123130
"filter": "fasttext_filter",
124131
"parameters": {

pipeline/clean/opuscleaner/configs/en-ko/default.filters.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,13 @@
5656
},
5757
"language": null
5858
},
59+
{
60+
"filter": "url_mismatch",
61+
"parameters": {
62+
"DEBUG": false
63+
},
64+
"language": null
65+
},
5966
{
6067
"filter": "fasttext_filter",
6168
"parameters": {

pipeline/clean/opuscleaner/configs/en-zh/WikiMatrix-v1.filters.json

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,126 @@
33
"files": [
44
],
55
"filters": [
6+
{
7+
"filter": "remove_empty_lines",
8+
"parameters": {},
9+
"language": null
10+
},
11+
{
12+
"filter": "normalize_whitespace",
13+
"parameters": {
14+
"COLLAPSE": true
15+
},
16+
"language": "<src>"
17+
},
18+
{
19+
"filter": "deescape-special-chars",
20+
"parameters": {
21+
"LANG1": "other"
22+
},
23+
"language": "<src>"
24+
},
25+
{
26+
"_comment": "Normalize to full-width punctuation",
27+
"filter": "opus.RegExpSub",
28+
"parameters": {
29+
"patterns": [
30+
[
31+
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\s*\\?",
32+
"\\1\uff1f",
33+
0,
34+
""
35+
],
36+
[
37+
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\s*\\!",
38+
"\\1\uff01",
39+
0,
40+
""
41+
],
42+
[
43+
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\.\\s*(?!\\s*[\\.a-zA-Z])",
44+
"\\1\uff61",
45+
0,
46+
""
47+
],
48+
[
49+
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\s*:",
50+
"\\1\uff1a",
51+
0,
52+
""
53+
],
54+
[
55+
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\s*;",
56+
"\\1\uff1b",
57+
0,
58+
""
59+
],
60+
[
61+
"\\(",
62+
"\uff08",
63+
0,
64+
""
65+
],
66+
[
67+
"\\)",
68+
"\uff09",
69+
0,
70+
""
71+
]
72+
]
73+
},
74+
"language": "<trg>"
75+
},
76+
{
77+
"filter": "max_length",
78+
"parameters": {
79+
"MAXLENGTH": 150,
80+
"MINLENGTH": 1
81+
},
82+
"language": null
83+
},
84+
{
85+
"filter": "fix_wiki",
86+
"parameters": {
87+
"ALWAYS": false,
88+
"FOOTNOTES": true,
89+
"URLS": true,
90+
"WIKILINKS": true,
91+
"CODE": true,
92+
"HEADINGS": true,
93+
"REMOVEEMPTYLINES": true
94+
},
95+
"language": null
96+
},
97+
{
98+
"filter": "alpha_ratio",
99+
"parameters": {
100+
"LANG1": "<src>",
101+
"LANG2": "<trg>",
102+
"SRCWORDRAT": 0.4,
103+
"TRGWORDRAT": 0.0,
104+
"SRCALPHARAT": 0.5,
105+
"TRGALPHARAT": 0.0,
106+
"DEBUG": false
107+
},
108+
"language": null
109+
},
110+
{
111+
"filter": "url_mismatch",
112+
"parameters": {
113+
"DEBUG": false
114+
},
115+
"language": null
116+
},
117+
{
118+
"filter": "fasttext_filter",
119+
"parameters": {
120+
"FASTTEXT_MODEL_TYPE": "large",
121+
"LANG1": "<src>",
122+
"LANG2": "<trg>"
123+
},
124+
"language": null
125+
},
6126
{
7127
"filter": "regexp",
8128
"parameters": {

0 commit comments

Comments
 (0)