This is repository of Japanese ELECTRA model.
We trained ELECTRA model based on Google/ELECTRA code and follow the pipeline from yoheikikuta to switch to use SentencePiece. Big thanks to Google team and yoheikikuta for their efforts.
In this Readme, I combined documents from Google, yoheikikuta and my explanation.
We provide pretrained Electra model and trained SentencePiece model for Japanese text.
Training data is the Japanese wikipedia corpus from Wikimedia Downloads.
Please download all objects of Electra in the following google drive to data/models/electra_small/ directory, then move the vocab file to data/.
Loss function during training is as below:
These instructions pre-train a small ELECTRA model (12 layers, 256 hidden size). The pre-training task of Electra-small took 6 days with 1M steps by using 1 GPU Radeon VII 16GB. All scripts for pretraining from scratch are provided. Follow the instructions below.
Please download jawiki-data from this link and extract it. It's also the dataset we used to pretrain Electra.
python pretrain/train-sentencepiece.py
- Run
python extract_wiki_data.pyto extract the dataset.
- Place the vocab file and sentencepiece model in
data/wiki-ja.vocabanddata/wiki-ja.model. - Run
python build_japanesewiki_pretrain_data.py --data-dir data/ --model-dir data/ --num-processes 4.
It pre-processes/tokenizes the data and outputs examples as tfrecord files underdata/pretrain_tfrecords.
This step is exactly the same as in Google/Electra document. You can refer this script below:
python run_pretraining.py \
--data-dir data \
--model-name electra_small_japanese \
--hparams '{"debug": false, "do_train": true, "do_eval": false, "vocab_file": "data/vocab.txt", "model_sentencepiece_path": "model_sentence_piece/wiki-ja.model", "model_size": "small", "vocab_size": 32000, "max_seq_length": 512, "num_train_steps": 1000000, "train_batch_size": 64}'
