In the global landscape of today, effective communication across languages is essential. Yet, low-resource languages encounter substantial barriers due to the scarcity of available data. These languages are at a disadvantage with limited access to automatic translation services, with some not being supported by leading online translation platforms such as Google Translate. The challenge extends to securing large, high-quality datasets, which are crucial for traditional deep-learning approaches that depend on extensive training data.
This project seeks to address a critical issue in the field of machine translation: the data scarcity for low-resource languages. We aim to evaluate and compare different data augmentation strategies, specifically focusing on improving machine translation for these underrepresented languages.
Use python 3.10.13
in your preferred environment and run
pip install -r requirements.txt
We use Accelerate to leverage hardware accelerators for mixed precision training, gradient accumulation. For logging we use Weights & Biases.
cd src
Create accelerate
config:
accelerate config
Run training:
accelerate launch -m training
For changing hyperparameters we used hydra. For example, to change the languages the model trains on, you can run:
accelerate launch training.py data.l1=de data.l2=en
Same goes for all other parameter defined in src/conf/
.
We use wandb sweep. Run
wandb sweep conf/sweep/<sweep_name>.yaml
which produces a command like this:
wandb agent ...
which will start the tuning. This way the tuning can be done on multiple machines at once.
We use WMT 2014, a collection of datasets used in shared tasks of the Ninth Workshop on Statistical Machine Translation. WMT 2014 English-to-German, English-to-French are some of the most common datasets from WMT 2014 for machine translation.
We implement three text augmentation methods to expand our dataset in a low-resource setting. These methods are aimed at increasing the diversity of the data and improving model generalization. To configure which augmentation method to use, you can specify the augmenter by its corresponding name synonym
, backtrans
, antonym
, or null
, if no augmentation is needed.
-
Synonym Replacement Augmentation: This involves replacing words in the text with their synonyms while preserving the original meaning. This technique is inspired by the work on Character-level Convolutional Networks for Text. To specify the language for synonym replacement, include the
lang
argument followed by the language code (e.g.,eng
for English).- To augment French text with synonyms:
augmenter=synonym augmenter.lang1=fra
- To augment English text with synonyms:
augmenter=synonym augmenter.lang2=eng
- To augment French text with synonyms:
-
Back Translation Augmentation: Back translation involves translating the text from one language to another that may not necessarily correspond to the secondary language in the model, and then back to the original language. This can introduce variations in the text while retaining its semantic meaning. This method was first proposed in Improving Neural Machine Translation Models with Monolingual Data. To specify the language pair for back translation, use the
from_model
andto_model
arguments followed by the corresponding model names. If one of the model names is set tonull
, only the other language will be augmented.- To augment French text using the specified translation models:
augmenter=backtrans augmenter.from_model1=Helsinki-NLP/opus-mt-fr-en augmenter.to_model1=Helsinki-NLP/opus-mt-en-fr
- To augment English text using the specified translation models:
augmenter=backtrans augmenter.from_model2=facebook/wmt19-en-de augmenter.to_model2=facebook/wmt19-de-en
- To augment French text using the specified translation models:
-
Antonym Replacement Augmentation: This method replaces words in the text with their antonyms, altering the meaning while preserving the structure of the sentence. To specify the language for antonym replacement, include the
lang
argument with the language code (e.g.,eng
for English). If one of the languages is set tonull
, only the other language will be augmented.- To augment French text with antonyms:
augmenter=antonym augmenter.lang1=fra
- To augment English text with antonyms:
augmenter=antonym augmenter.lang2=eng
- To augment French text with antonyms: