Welcome to gpt2-diy — a personal, from-scratch reproduction of GPT-2, inspired by Andrej Karpathy’s video Let's Reproduce GPT-2.
This project aims to deeply understand transformer-based language models by rebuilding GPT-2 independently, referencing original papers and consulting existing codebases only lightly when necessary.
The best way to learn is by doing.
gpt2-diy is a hands-on journey to build skill and acquire deep knowledge of modern language models. By independently reproducing GPT-2, this project hopefully helps with practical understanding of architectures, training dynamics, optimization strategies, and scaling laws.
- Let's Reproduce GPT-2 by Andrej Karpathy
- Attention is All You Need (Vaswani et al., 2017)
- Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) — GPT-1
- Language Models are Unsupervised Multitask Learners (Radford et al., 2019) — GPT-2
- Language Models are Few-Shot Learners (Brown et al., 2020) — GPT-3
A rough outline based on the "Let's Reproduce GPT-2" methodology:
-
Skim Papers
Collect factual information about architecture design and training strategies. -
Load OpenAI Model Weights
Use Hugging Face Transformers to load original GPT-2 weights, providing a reference for our own implementation. -
Implement Model in PyTorch
Write the model relying only on the papers. Use the Hugging Face repo solely to copy correct weight names for compatibility. -
Model Loading Method (
from_pretrained)
Add a method to load Hugging Face's GPT-2 weights into the custom model. -
Implement Generate Functionality
Add text generation to validate the loaded model weights. -
Tiny Shakespeare Data Preparation
-
Compute Loss (at Initialization as well)
-
Optimization: Check on One Batch
-
Data Loader Lite
-
Paper Adjustments
- Weight tying between embedding and output projection layers.
- Correct weight initialization.
-
Speed Up Training
- Enable TensorFloat32 (TF32) precision.
- Use bf16 where available (Ampere+ GPUs).
- Apply
torch.compilefor compilation speedup. - Integrate Flash Attention.
- Tune batch sizes for hardware efficiency.
-
Optimization Settings from GPT-3
- Gradient accumulation to emulate a large effective batch size (~0.5M tokens).
-
Distributed Data Parallel (DDP)
-
Switch to a "Real" Dataset
Move training from Tiny Shakespeare to a FineWeb sample. -
Evaluation and Logging
Set up evaluation tasks like HellaSwag for zero-shot benchmarks.
- GPT-2 uses new dataset of millions of web pages (WebText)
- The capacity of the language model is essential to the success of zero-shot task transfer
- This leads to the path of building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
- Data - web scrape emphasizing document quality, no assumptions about specific downstream tasks
- Follows transformer architecture and gpt-1
From gpt-1 paper:
- 12 layer decoder only with masked self-attention heads
- 768 dimensional states and 12 attention heads = head size is 64, we compute compatibility function (q @ k) and weighted sum of values (att @ v) for all heads independently and in parallel
- 3072 (768 * 4) dimensional inner state for position-wise feed-forward networks
- Weight init N(0, 0.02).
- Residual, embedding and attention dropout p=0.1
- Adamw baked L2 regularization w = 0.01 (for non-bias or gain weights)
- GELU non-linearity
- Learned position embeddings
- Adam optimization with max lr = 2.5e-4
- learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule
- 100 epochs
- batch size 64
- block size 512
Gpt-2 modifications:
- Layer-norm moved to the input of each sub-block (block = attention + feed forward) and one layer-norm after the last self-attention block
- Modified initialization: scale the weights of residual layers at initialization by a factor of 1/sqrt(N), where N is the number of residual layers
- Vocabulary -> to 50,257
- block size 1024
Gpt-3:
- Adam with β1 = 0.9, β2 = 0.95, and = 10e−8
- Clip the global norm of the gradient at 1.0
- Cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 billion tokens, training continues at 10% of the original learning rate)
- linear LR warmup over the first 375 million tokens (roughly 10%)
- All models were trained for a total of 300 billion tokens
🚧 Work in Progress: This project is actively being built and improved step-by-step.