This directory contains the Pytorch ML reference for GPT-2 model.
GPT-2 is a decoder-only transformer-based model designed by OpenAI. It uses a stack of transformer blocks with dot-product self-attention followed by a multi-layer perceptron feed-forward network.
- Autoregressive language modeling: The model predicts the next token from the prior context at every position in the sequence (compare to BERT, which uses an autoencoding loss that predicts masked positions from the rest of the unmasked sequence). Autoregressive language modeling requires masking the future positions in the sequence.
- Layer norms in the transformer blocks are located inside the residual connections ahead of the self-attention or feed-forward network (compare to BERT and GPT, which have layer norms outside of the residual block). The GPT-2 layer norm location has the effect of allowing transformer block to elongate token embeddings through the depth of the model (i.e., maybe more representational capacity).
- GPT-2 does not add any auxiliary losses (compare to BERT, which uses next sentence prediction (NSP), or ALBERT which uses sentence-order prediction (SOP)).
GPT-2 with 48-layers, and a total of 1542M parameters, has more than an order of magnitude more
parameters than GPT.
Reference: Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners.
In order to run any of the models in this directory, you must go through the following steps:
- Download and preprocess the data (see Prepare the data for more details)
- Run training for your desired model (see Run pre-training)
configs/: YAML configuration files.model.py: The entry point to the model. DefinesGpt2Modelwhich supports GPT-2.run.py: Training script. Performs training and validation.utils.py: Miscellaneous scripts to parse theparamsdictionary from the YAML files.
You need to download raw OWT data following these instructions and create preprocessed dataset files using preprocess_data.py.
If you want to use your own data loader with this example code, then this section describes the input data format expected by Gpt2Model class defined in model.py. The Gpt2Model supports GPT-2.
When you create your own custom GPT input function, you must ensure that your GPT input function produces a features dictionary as described in this section.
We recommended using GptHDF5DataProcessor for the input function of GPT-2 model (for performance reasons). The instructions to create a HDF5 dataset can be found here: Data Preprocessing. We also support the following Data Processors:
- HuggingFaceDataProcessorEli5: An example of using HuggingFace Eli5 dataset (Map-Style).
- HuggingFaceIterableDataProcessorEli5: An example of using HuggingFace Eli5 dataset (Iterable).
- DummyDataProcessor: An example of using an arbitrary Map-Style PyTorch dataset.
- DummyIterableDataProcessor: An example of using an arbitrary Iterable PyTorch dataset.
NOTE: More information on using of HuggingFace datasets can be found in this document: Using HuggingFace datasets for auto-regressive LM
The features dictionary has the following key/values:
input_ids: Input token IDs, padded with0tomax_sequence_length.- Shape:
(batch_size, max_sequence_length) - Type:
torch.int32
- Shape:
attention_mask: Mask for padded positions. Has values0on the padded positions and1elsewhere.- Shape:
(batch_size, max_sequence_length) - Type:
torch.int32
- Shape:
labels: Labels for language modeling pre-training task, padded with0tomax_sequence_length.- Shape:
(batch_size, max_sequence_length) - Type:
torch.int32
- Shape:
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a pre-training run, make sure that in the YAML config file you are using:
- The
train_input.data_dirparameter points to the correct dataset, and - The
train_input.max_sequence_lengthparameter corresponds to the sequence length of the dataset. - The
model.max_position_embeddingsparameter corresponds to the maximum dimension of position embeddings.
YAML config files: Details on the configs for this model can be found in Configs included for this model
In the following example run commands, we use /path/to/yaml, /path/to/model_dir, and train as placeholders for user supplied inputs.
/path/to/yamlis a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_diris a path to the directory where you would like to store the logs and other artifacts of the run.--modespecifies the desired mode to run the model in. Change to--mode evalto run in eval mode.
Please follow the instructions on our quickstart in the Developer Docs.
If running on a cpu or gpu, activate the environment from Python GPU Environment setup, and simply run:
python run.py {CPU,GPU} --mode train --params /path/to/yaml --model_dir /path/to/model_dir
In order to train the model, you need to provide a yaml config file. Some popular yaml configs files are listed below for reference. Also, feel free to create your own following these examples:
Configs below are meant to be run on Pipeline mode
- params_gpt2_small.yaml have the standard gpt2-base config with
hidden_size=768,num_hidden_layers=12,num_heads=12, for Weight Streaming mode. - params_gpt2_medium.yaml: A 345M parameter model with the standard gpt2-medium config with
hidden_size=1024,num_hidden_layers=24,num_heads=16. - params_gpt2_large.yaml: A 774M parameter model with the standard gpt2-large config with
hidden_size=1280,num_hidden_layers=36,num_heads=20. - params_gpt2_xl.yaml: A 1.3B parameter model with the standard gpt2-xl config with
hidden_size=1600,num_hidden_layers=48,num_heads=16.
Reference: Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners.
