Skip to content

zechenli03/CERT-ERNIE-Pytorch

Repository files navigation

CERT-ERNIE-Pytorch

This repository contains code for Hugging Face implementation of BERT and uses huggingface's format ERNIE converted by nghuyong2019.

Getting Started

You can directly download the ernie model nghuyong2019 have converted or directly load by huggingface's transformers or convert it with nghuyong2019's code.

Firstly, please intall all the package we needed in this task pip install -r requirements.txt

Contrastive Self-supervised Learning(CSSL) Pretraining

Data Augmentation

If the language in your task dataset is English, for each input sentence x in the target task, you could augment it by first using an English-to-German machine translation model to translate x to y, and then using a German-to-English translation model to translate y to x'. The x' is regarded as an augmented sentence of x. Similarly, you could use an English-to-Chinese machine translation model and a Chinese-to-English machine translation model to obtain another augmented sentence x“.

Then, you could save your augmented data into augmented_data folder.

MoCo Task

We use Momentum Contrast(MoCo) to implement CSSL. The steps are as follows.

  • Build a new folder called moco_model to store your pretrained model with
mkdir moco_model
  • You need to change the number of negtive samples in line 86 of MOCO.py.
    Notice: The amount of Augmentated data(negtive samples) must be an integer multiple of batch_size
  • Set your own parameters and run MOCO.py to implement pretraining process.
python MOCO.py \
  --lr 0.0001 \
  --batch-size 32 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed \
  --world-size 1 \
  --rank 0 
  • After training, you can extract encoder_q from the whole model with
python trans.py

P.S. If you want to use an encoder other than ERNIE 2.0, you could change the encoder name or path in line26 ~ line 38 of builder.py, line21 of MOCO.py and line16 of trans.py with any model huggingface's provided or fits the huggingface's format.

Fine-tune on GLUE tasks

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

Before running any of these GLUE tasks you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

You may also need to set the two following environment variables:

  • GLUE_DIR: This should point to the location of the GLUE data.
  • TASK_NAME: Task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
  • STATE_DICT: An optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models.

Example 1: Fine-tuning from MOCO model

export GLUE_DIR=./glue_data
export STATE_DICT=./moco_model/moco.p
export TASK_NAME=RTE

python run_glue.py \
    --model_name_or_path nghuyong/ernie-2.0-large-en \
    --state_dict $STATE_DICT \
    --task_name $TASK_NAME \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --per_device_train_batch_size 16 \
    --weight_decay 0 \
    --learning_rate 3e-5 \
    --num_train_epochs 5.0 \
    --save_steps 156 \
    --warmup_steps 78 \
    --logging_steps 39 \
    --eval_steps 39 \
    --seed 33333 \
    --output_dir /tmp/$TASK_NAME/

Example 2: Fine-tuning from ERNIE model

export GLUE_DIR=./glue_data
export TASK_NAME=RTE

python run_glue.py \
    --model_name_or_path nghuyong/ernie-2.0-large-en \
    --task_name $TASK_NAME \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --per_device_train_batch_size 16 \
    --weight_decay 0 \
    --learning_rate 3e-5 \
    --num_train_epochs 5.0 \
    --save_steps 156 \
    --warmup_steps 78 \
    --logging_steps 39 \
    --eval_steps 39 \
    --seed 199733 \
    --output_dir /tmp/$TASK_NAME/

Example 3: Fine-tuning from BERT model

export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

You can take run_cert_rte.sh as an example.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published