A complete GPT (Generative Pre-trained Transformer) implementation from first principles using PyTorch. No external LLM libraries - just pure PyTorch implementations of attention, transformers, and language modeling.
This project implements the GPT architecture from scratch for educational purposes. It demonstrates a deep understanding of transformer models, attention mechanisms, and language model training - all implemented without relying on high-level LLM libraries.
- Multi-Head Self-Attention - The core mechanism enabling transformers
- Transformer Blocks - Pre-LN architecture as used in GPT-2
- Positional Embeddings - Learned positional encodings
- BPE Tokenizer - Byte Pair Encoding from scratch
- Training Pipeline - With mixed precision, gradient clipping, LR scheduling
- Text Generation - Temperature, top-k, and nucleus sampling
- Web Interface - Streamlit demo for interactive generation
- REST API - FastAPI server for production deployment
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Input Tokens β
β [Tβ, Tβ, Tβ, ..., Tβ] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Token Embedding (vocab β d_model) β
β + β
β Positional Embedding (pos β d_model) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Dropout β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Transformer Block (ΓN) β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β LayerNorm β Multi-Head Self-Attention β Residual Add β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β LayerNorm β Feed-Forward MLP (4Γd) β Residual Add β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Final LayerNorm β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Linear Projection (d_model β vocab) β
β (Weight tied with embedding) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Output Logits β
β [P(Tβ), P(Tβ), ..., P(Tα΅₯)] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββ
β Input (B, T, C) β
βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Query β β Key β β Value β
β Linear β β Linear β β Linear β
βββββββββββ βββββββββββ βββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Reshape β β Reshape β β Reshape β
β(B,H,T,D)β β(B,H,T,D)β β(B,H,T,D)β
βββββββββββ βββββββββββ βββββββββββ
β β β
ββββββββββ¬ββββββββββ β
βΌ β
βββββββββββββββββββ β
β QKα΅ / βd_k β β
βββββββββββββββββββ β
β β
βΌ β
βββββββββββββββββββ β
β Causal Mask β β
β (future = -β) β β
βββββββββββββββββββ β
β β
βΌ β
βββββββββββββββββββ β
β Softmax β β
βββββββββββββββββββ β
β β
ββββββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββ
β Attention Γ V β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Concat & β
β Output Linear β
βββββββββββββββββββ
| Config | Params | Layers | Heads | d_model | Context | Use Case |
|---|---|---|---|---|---|---|
nano |
~1M | 3 | 3 | 192 | 64 | CPU testing |
micro |
~3M | 4 | 4 | 256 | 128 | Quick experiments |
mini |
~10M | 6 | 6 | 384 | 256 | Laptop training |
small |
~124M | 12 | 12 | 768 | 1024 | GPT-2 Small |
medium |
~350M | 24 | 16 | 1024 | 1024 | GPT-2 Medium |
# Clone repository
git clone https://github.com/ashwani65/gpt-from-scratch.git
cd gpt-from-scratch
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Train on Shakespeare (quick demo)
python train.py --config nano --epochs 10
# Train mini model on GPU
python train.py --config mini --epochs 20 --device cuda
# Custom training
python train.py --config micro --data data/your_text.txt --batch_size 32 --lr 3e-4# Generate from checkpoint
python generate.py --checkpoint checkpoints/best_model.pt --prompt "To be or not to be"
# Interactive mode
python generate.py --checkpoint checkpoints/best_model.pt --interactive
# With sampling parameters
python generate.py --checkpoint checkpoints/best_model.pt \
--prompt "Once upon a time" \
--temperature 0.8 \
--top_k 50 \
--max_tokens 200# Streamlit demo
streamlit run serving/streamlit_app.py
# FastAPI server
uvicorn serving.fastapi_server:app --reload --port 8000gpt-from-scratch/
βββ gpt/
β βββ __init__.py
β βββ config.py # Model & training configurations
β βββ model.py # GPT model (attention, transformer, etc.)
β βββ tokenizer.py # BPE tokenizer implementation
β βββ dataset.py # Data loading utilities
βββ serving/
β βββ streamlit_app.py # Web UI demo
β βββ fastapi_server.py # REST API server
βββ notebooks/
β βββ 01_attention.ipynb # Understanding attention
β βββ 02_training.ipynb # Training walkthrough
βββ data/
β βββ shakespeare.txt # Sample training data
βββ checkpoints/ # Saved models
βββ train.py # Training script
βββ generate.py # Generation script
βββ requirements.txt
βββ Dockerfile
βββ README.md
Attention(Q, K, V) = softmax(QK^T / βd_k) Γ VThe scaling factor 1/βd_k prevents the dot products from growing too large, which would push softmax into regions with extremely small gradients.
For autoregressive language modeling, we mask future tokens to prevent the model from "cheating":
mask = torch.tril(torch.ones(T, T))
attn = attn.masked_fill(mask == 0, float('-inf'))GPT-2 uses Pre-LN (LayerNorm before attention/MLP) instead of Post-LN:
x = x + attention(layer_norm(x))
x = x + mlp(layer_norm(x))The token embedding weights are shared with the output projection layer, reducing parameters and often improving performance.
| Model | Training Time | Val Loss | Sample Quality |
|---|---|---|---|
| nano | ~5 min (CPU) | 1.8 | Basic patterns |
| micro | ~15 min (CPU) | 1.5 | Recognizable style |
| mini | ~30 min (GPU) | 1.3 | Good coherence |
Prompt: "To be or not to be"
Temperature 0.5 (focused):
To be or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune...
Temperature 1.0 (creative):
To be or not to be, what dreams may come
When we have shuffled off this mortal coil
Must give us pauseβthere's the respect...
| Aspect | This Implementation | GPT-2/3/4 |
|---|---|---|
| Architecture | Same core | Same + optimizations |
| Parameters | 1M - 350M | 124M - 1.7T |
| Training Data | 1MB - 1GB | 40GB - 570GB+ |
| Training Time | Minutes - Hours | Days - Months |
| Training Cost | Free (Colab) | $50K - $100M |
| Capabilities | Basic generation | SOTA performance |
Key Insight: The fundamental architecture is ~90% identical. The difference is scale: more parameters, more data, more compute, plus post-training alignment (RLHF).
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| GET | /info |
Model information |
| POST | /generate |
Generate text |
| POST | /tokenize |
Tokenize text |
POST /generate
{
"prompt": "Once upon a time",
"max_tokens": 200,
"temperature": 0.8,
"top_k": 50,
"top_p": 0.95
}This project was inspired by and builds upon:
- Attention Is All You Need - Original Transformer paper
- GPT-2 Paper
- Karpathy's nanoGPT - Minimal GPT implementation
- Raschka's LLMs from Scratch - Comprehensive book
MIT License - see LICENSE file.
Ashwani Singh
- GitHub: @ashwani65
- LinkedIn: Ashwani Singh
Built as part of M.Tech in AI studies at IIT Jodhpur.