CERT-ERNIE-Pytorch

This repository contains code for Hugging Face implementation of BERT and uses huggingface's format ERNIE converted by nghuyong2019.

Getting Started

You can directly download the ernie model nghuyong2019 have converted or directly load by huggingface's transformers or convert it with nghuyong2019's code.

Firstly, please intall all the package we needed in this task pip install -r requirements.txt

Contrastive Self-supervised Learning(CSSL) Pretraining

Data Augmentation

If the language in your task dataset is English, for each input sentence x in the target task, you could augment it by first using an English-to-German machine translation model to translate x to y, and then using a German-to-English translation model to translate y to x'. The x' is regarded as an augmented sentence of x. Similarly, you could use an English-to-Chinese machine translation model and a Chinese-to-English machine translation model to obtain another augmented sentence x“.

Then, you could save your augmented data into augmented_data folder.

MoCo Task

We use Momentum Contrast(MoCo) to implement CSSL. The steps are as follows.

Build a new folder called moco_model to store your pretrained model with

mkdir moco_model

You need to change the number of negtive samples in line 86 of MOCO.py.
Notice: The amount of Augmentated data(negtive samples) must be an integer multiple of batch_size
Set your own parameters and run MOCO.py to implement pretraining process.

python MOCO.py \
  --lr 0.0001 \
  --batch-size 32 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed \
  --world-size 1 \
  --rank 0

After training, you can extract encoder_q from the whole model with

python trans.py

P.S. If you want to use an encoder other than ERNIE 2.0, you could change the encoder name or path in line26 ~ line 38 of builder.py, line21 of MOCO.py and line16 of trans.py with any model huggingface's provided or fits the huggingface's format.

Fine-tune on GLUE tasks

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

Before running any of these GLUE tasks you should download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

You may also need to set the two following environment variables:

GLUE_DIR: This should point to the location of the GLUE data.
TASK_NAME: Task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
STATE_DICT: An optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models.

Example 1: Fine-tuning from MOCO model

export GLUE_DIR=./glue_data
export STATE_DICT=./moco_model/moco.p
export TASK_NAME=RTE

python run_glue.py \
    --model_name_or_path nghuyong/ernie-2.0-large-en \
    --state_dict $STATE_DICT \
    --task_name $TASK_NAME \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --per_device_train_batch_size 16 \
    --weight_decay 0 \
    --learning_rate 3e-5 \
    --num_train_epochs 5.0 \
    --save_steps 156 \
    --warmup_steps 78 \
    --logging_steps 39 \
    --eval_steps 39 \
    --seed 33333 \
    --output_dir /tmp/$TASK_NAME/

Example 2: Fine-tuning from ERNIE model

export GLUE_DIR=./glue_data
export TASK_NAME=RTE

python run_glue.py \
    --model_name_or_path nghuyong/ernie-2.0-large-en \
    --task_name $TASK_NAME \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --per_device_train_batch_size 16 \
    --weight_decay 0 \
    --learning_rate 3e-5 \
    --num_train_epochs 5.0 \
    --save_steps 156 \
    --warmup_steps 78 \
    --logging_steps 39 \
    --eval_steps 39 \
    --seed 199733 \
    --output_dir /tmp/$TASK_NAME/

Example 3: Fine-tuning from BERT model

export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

You can take run_cert_rte.sh as an example.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
augmented_data		augmented_data
moco		moco
.DS_Store		.DS_Store
MOCO.py		MOCO.py
README.md		README.md
download_glue_data.py		download_glue_data.py
moco_train.ipynb		moco_train.ipynb
requirements.txt		requirements.txt
run_cert_rte.sh		run_cert_rte.sh
run_glue.py		run_glue.py
run_moco.sh		run_moco.sh
trans.py		trans.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CERT-ERNIE-Pytorch

Getting Started

Contrastive Self-supervised Learning(CSSL) Pretraining

Data Augmentation

MoCo Task

Fine-tune on GLUE tasks

About

Uh oh!

Releases

Packages

Languages

zechenli03/CERT-ERNIE-Pytorch

Folders and files

Latest commit

History

Repository files navigation

CERT-ERNIE-Pytorch

Getting Started

Contrastive Self-supervised Learning(CSSL) Pretraining

Data Augmentation

MoCo Task

Fine-tune on GLUE tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages