You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/data-and-cleaning/datasets.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,7 @@ Data source | Prefix | Name examples
15
15
[OPUS](https://opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | parallel | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
16
16
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | parallel | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
17
17
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | parallel | Evaluation dataset from Facebook that supports 100 languages.
18
+
[NTREX-128](https://github.com/MicrosoftTranslator/NTREX) | ntrex | devtest | parallel | Evaluation dataset from Microsoft that supports 128 languages.
18
19
Custom parallel | url | `https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst` | parallel | A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
0 commit comments