gpt2-diy

Welcome to gpt2-diy — a personal, from-scratch reproduction of GPT-2, inspired by Andrej Karpathy’s video Let's Reproduce GPT-2.

This project aims to deeply understand transformer-based language models by rebuilding GPT-2 independently, referencing original papers and consulting existing codebases only lightly when necessary.

Motivation

The best way to learn is by doing.

gpt2-diy is a hands-on journey to build skill and acquire deep knowledge of modern language models. By independently reproducing GPT-2, this project hopefully helps with practical understanding of architectures, training dynamics, optimization strategies, and scaling laws.

Inspirations and References

Project Plan

A rough outline based on the "Let's Reproduce GPT-2" methodology:

Skim Papers
Collect factual information about architecture design and training strategies.
Load OpenAI Model Weights
Use Hugging Face Transformers to load original GPT-2 weights, providing a reference for our own implementation.
Implement Model in PyTorch
Write the model relying only on the papers. Use the Hugging Face repo solely to copy correct weight names for compatibility.
Model Loading Method (from_pretrained)
Add a method to load Hugging Face's GPT-2 weights into the custom model.
Implement Generate Functionality
Add text generation to validate the loaded model weights.
Tiny Shakespeare Data Preparation
Compute Loss (at Initialization as well)
Optimization: Check on One Batch
Data Loader Lite
Paper Adjustments
- Weight tying between embedding and output projection layers.
- Correct weight initialization.
Speed Up Training
- Enable TensorFloat32 (TF32) precision.
- Use bf16 where available (Ampere+ GPUs).
- Apply torch.compile for compilation speedup.
- Integrate Flash Attention.
- Tune batch sizes for hardware efficiency.
Optimization Settings from GPT-3
- Gradient accumulation to emulate a large effective batch size (~0.5M tokens).
Distributed Data Parallel (DDP)
Switch to a "Real" Dataset
Move training from Tiny Shakespeare to a FineWeb sample.
Evaluation and Logging
Set up evaluation tasks like HellaSwag for zero-shot benchmarks.

Paper Notes (for all of them)

GPT-2 uses new dataset of millions of web pages (WebText)
The capacity of the language model is essential to the success of zero-shot task transfer
This leads to the path of building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Data - web scrape emphasizing document quality, no assumptions about specific downstream tasks
Follows transformer architecture and gpt-1

From gpt-1 paper:

12 layer decoder only with masked self-attention heads
768 dimensional states and 12 attention heads = head size is 64, we compute compatibility function (q @ k) and weighted sum of values (att @ v) for all heads independently and in parallel
3072 (768 * 4) dimensional inner state for position-wise feed-forward networks
Weight init N(0, 0.02).
Residual, embedding and attention dropout p=0.1
Adamw baked L2 regularization w = 0.01 (for non-bias or gain weights)
GELU non-linearity
Learned position embeddings
Adam optimization with max lr = 2.5e-4
learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule
100 epochs
batch size 64
block size 512

Gpt-2 modifications:

Layer-norm moved to the input of each sub-block (block = attention + feed forward) and one layer-norm after the last self-attention block
Modified initialization: scale the weights of residual layers at initialization by a factor of 1/sqrt(N), where N is the number of residual layers
Vocabulary -> to 50,257
block size 1024

Gpt-3:

Adam with β1 = 0.9, β2 = 0.95, and = 10e−8
Clip the global norm of the gradient at 1.0
Cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 billion tokens, training continues at 10% of the original learning rate)
linear LR warmup over the first 375 million tokens (roughly 10%)
All models were trained for a total of 300 billion tokens

🚧 Work in Progress: This project is actively being built and improved step-by-step.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt
play.ipynb		play.ipynb
train_gpt2.py		train_gpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt2-diy

Motivation

Inspirations and References

Project Plan

Paper Notes (for all of them)

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpt2-diy

Motivation

Inspirations and References

Project Plan

Paper Notes (for all of them)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages