mozilla
diff --git a/‎docs/data-and-cleaning/index.md‎
Lines changed: 18 additions & 45 deletions b/‎docs/data-and-cleaning/index.md‎
Lines changed: 18 additions & 45 deletions
diff --git a/‎docs/training/README.md‎
Lines changed: 1 addition & 6 deletions b/‎docs/training/README.md‎
Lines changed: 1 addition & 6 deletions
diff --git a/‎pipeline/clean/clean-parallel.sh‎
Lines changed: 0 additions & 114 deletions b/‎pipeline/clean/clean-parallel.sh‎
Lines changed: 0 additions & 114 deletions
diff --git a/‎pipeline/clean/opuscleaner/clean-parallel.sh‎
Lines changed: 3 additions & 1 deletion b/‎pipeline/clean/opuscleaner/clean-parallel.sh‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎pipeline/clean/opuscleaner/configs/default.filters.json‎
Lines changed: 7 additions & 0 deletions b/‎pipeline/clean/opuscleaner/configs/default.filters.json‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎pipeline/clean/opuscleaner/configs/en-ja/default.filters.json‎
Lines changed: 7 additions & 0 deletions b/‎pipeline/clean/opuscleaner/configs/en-ja/default.filters.json‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎pipeline/clean/opuscleaner/configs/en-ko/default.filters.json‎
Lines changed: 7 additions & 0 deletions b/‎pipeline/clean/opuscleaner/configs/en-ko/default.filters.json‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎pipeline/clean/opuscleaner/configs/en-zh/WikiMatrix-v1.filters.json‎
Lines changed: 120 additions & 0 deletions b/‎pipeline/clean/opuscleaner/configs/en-zh/WikiMatrix-v1.filters.json‎
Lines changed: 120 additions & 0 deletions
@@ -2,49 +2,11 @@
 
 Making datasets less noisy to improve quality of translation.
 
-## Regular pipeline
-
-Config setting:
-```
-  use-opuscleaner: false
-```
-
-### Dataset fixing
-
-Some datasets require fixes like detokenization.
-Dataset and language specific fixes are implemented in [https://github.com/mozilla/translations/tree/main/pipeline/clean/fixes](https://github.com/mozilla/translations/tree/main/pipeline/clean/fixes).
-Naming convention:
-- `<dataset_name>.sh` for parallel dataset cleaning
-- `<dataset_name>.<lang>.sh` for language specific cleaning of parallel or monolingual dataset
-- `/` in dataset name should be replaced with `_`
-
-### Cleaning scripts
-
-Make sure the language is present in [clean_parallel](https://github.com/mozilla/translations/tree/main/pipeline/clean/tools/clean_parallel.py#L19) script.
-
-
-### Bicleaner
-
-It is recommended to use Bicleaner ML models to filter noisy data.
-See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.
-
-
 ## OpusCleaner
 
-Another option is to use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
-
-Config setting:
-```
-  use-opuscleaner: "true"
-```
-
-To enable custom per-dataset filter configs add:
-```
-  opuscleaner-mode: "custom"
-```
-
+We use an all-in-one cleaning tool [OpusCleaner](https://github.com/hplt-project/OpusCleaner) by HPLT project.
 
-## Custom filter configs
+### Custom filter configs
 
 The idea behind OpusCleaner is customizing filter rules for each language pair and dataset
 to get a training corpus with less noise and train higher quality translation models.
@@ -74,9 +36,12 @@ or to
 
 `pipeline/clean/opuscleaner/configs/` for dataset specific filters that will apply to all language pairs.
 
-Make sure to replace the language codes to the template values `<src>` and `<trg>`. See examples in the directory.
+Make sure to replace the language codes to the template values `<src>` and `<trg>` and remove absolutes paths from the `"files"` section. 
+See examples in the directory.
 
-### Default config
+### Default configs
+
+Set `opuscleaner-mode: custom` in the training config to use custom per-dataset and per-language pair configs.
 
 If no custom config was specified for the dataset, 
 the [default config template](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs/default.filters.json) will be used.
@@ -85,7 +50,15 @@ Modify if needed. Some rules require specifying source or target language.
 The `<src>` and `<trg>` in the template will be automatically replaced with the trained language pair.
 The generated default config will be copied to the target dataset cleaning directory.
 
-### Running 
+The config is chosen based on this search order:
+1. Dataset and language specific: `configs/<language-pair>/<dataset>.filter.json`
+2. Language specific: `configs/<language-pair>/default.filter.json`
+3. Dataset specific: `configs/<dataset>.filter.json`
+4. Default: `configs/default.filter.json`
+
+The first found config will be applied.
+
+## Bicleaner
 
-Enable OpusCleaner in the training pipeline config and run the pipeline as usual. 
-OpusCleaner will replace the default [corpus-clean-parallel](https://github.com/mozilla/translations/tree/main/pipeline/clean/clean-parallel.sh) script.
+It is recommended to use Bicleaner ML models to filter noisy data.
+See the [bicleaner documentation](bicleaner.md) for more details on how to configure it.
@@ -155,12 +155,7 @@ For example:
 
 ## 4. Configure data cleaning
 
-The default configuration should work. 
-
-OpusCleaner is enabled by default:
-```yaml
-  use-opuscleaner: true
-```
+The default configuration should work.
 
 Enable OpusCleaner [custom cleaning rules](https://github.com/mozilla/translations/tree/main/pipeline/clean/opuscleaner/configs) if you want to use them for your language pair.
 ```yaml
 
@@ -28,7 +28,9 @@ mkdir -p "${dir}"
 echo "Downloading FastText model."
 # pre-download fast text model as it's causing constant issues
 filters_dir="/builds/worker/.local/lib/python3.10/site-packages/opuscleaner/filters"
-wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
+if [ -d ${filters_dir} ]; then
+  wget -O "${filters_dir}/large.bin" https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
+fi
 
 echo "### Generating cleaning config: ${dataset}.${SRC}-${TRG}.filters.json"
 # save new filter to dataset output dir
 
@@ -100,6 +100,13 @@
       },
       "language": null
     },
+    {
+      "filter": "url_mismatch",
+      "parameters": {
+        "DEBUG": false
+      },
+      "language": null
+    },
     {
       "filter": "fasttext_filter",
       "parameters": {
 
@@ -119,6 +119,13 @@
       },
       "language": null
     },
+    {
+      "filter": "url_mismatch",
+      "parameters": {
+        "DEBUG": false
+      },
+      "language": null
+    },
     {
       "filter": "fasttext_filter",
       "parameters": {
 
@@ -56,6 +56,13 @@
       },
       "language": null
     },
+    {
+      "filter": "url_mismatch",
+      "parameters": {
+        "DEBUG": false
+      },
+      "language": null
+    },
     {
       "filter": "fasttext_filter",
       "parameters": {
 
@@ -3,6 +3,126 @@
   "files": [
   ],
   "filters": [
+    {
+      "filter": "remove_empty_lines",
+      "parameters": {},
+      "language": null
+    },
+    {
+      "filter": "normalize_whitespace",
+      "parameters": {
+        "COLLAPSE": true
+      },
+      "language": "<src>"
+    },
+    {
+      "filter": "deescape-special-chars",
+      "parameters": {
+        "LANG1": "other"
+      },
+      "language": "<src>"
+    },
+    {
+      "_comment": "Normalize to full-width punctuation",
+      "filter": "opus.RegExpSub",
+      "parameters": {
+        "patterns": [
+          [
+            "([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\s*\\?",
+            "\\1\uff1f",
+            0,
+            ""
+          ],
+          [
+            "([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\s*\\!",
+            "\\1\uff01",
+            0,
+            ""
+          ],
+          [
+            "([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\.\\s*(?!\\s*[\\.a-zA-Z])",
+            "\\1\uff61",
+            0,
+            ""
+          ],
+          [
+            "([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\s*:",
+            "\\1\uff1a",
+            0,
+            ""
+          ],
+          [
+            "([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\s*;",
+            "\\1\uff1b",
+            0,
+            ""
+          ],
+          [
+            "\\(",
+            "\uff08",
+            0,
+            ""
+          ],
+          [
+            "\\)",
+            "\uff09",
+            0,
+            ""
+          ]
+        ]
+      },
+      "language": "<trg>"
+    },
+    {
+      "filter": "max_length",
+      "parameters": {
+        "MAXLENGTH": 150,
+        "MINLENGTH": 1
+      },
+      "language": null
+    },
+    {
+      "filter": "fix_wiki",
+      "parameters": {
+        "ALWAYS": false,
+        "FOOTNOTES": true,
+        "URLS": true,
+        "WIKILINKS": true,
+        "CODE": true,
+        "HEADINGS": true,
+        "REMOVEEMPTYLINES": true
+      },
+      "language": null
+    },
+    {
+      "filter": "alpha_ratio",
+      "parameters": {
+        "LANG1": "<src>",
+        "LANG2": "<trg>",
+        "SRCWORDRAT": 0.4,
+        "TRGWORDRAT": 0.0,
+        "SRCALPHARAT": 0.5,
+        "TRGALPHARAT": 0.0,
+        "DEBUG": false
+      },
+      "language": null
+    },
+    {
+      "filter": "url_mismatch",
+      "parameters": {
+        "DEBUG": false
+      },
+      "language": null
+    },
+    {
+      "filter": "fasttext_filter",
+      "parameters": {
+        "FASTTEXT_MODEL_TYPE": "large",
+        "LANG1": "<src>",
+        "LANG2": "<trg>"
+      },
+      "language": null
+    },
     {
       "filter": "regexp",
       "parameters": {