A step-by-step implementation of a GPT-like large language model from the ground up — covering everything from tokenization and attention mechanisms to pretraining, finetuning, and alignment. Built entirely in PyTorch with no high-level wrappers, so every component is transparent and hackable.
Beyond the core GPT implementation, this repo also includes from-scratch builds of Llama 3.2, Qwen3, and Gemma 3, plus modern techniques like LoRA, DPO, KV Cache, GQA, MoE, and Sliding Window Attention.
| # | Chapter | Description |
|---|---|---|
| 1 | Understanding Large Language Models | LLM fundamentals, development lifecycle, and landscape overview |
| 2 | Working with Text Data | Tokenization, byte-pair encoding, data loaders, and embedding layers |
| 3 | Coding Attention Mechanisms | Self-attention, causal masking, and multi-head attention from scratch |
| 4 | Implementing a GPT Model | Full GPT architecture — transformer blocks, layer norm, text generation |
| 5 | Pretraining on Unlabeled Data | Training loop, loss computation, weight loading, and text generation |
| 6 | Finetuning for Classification | Adapting the pretrained model for spam classification |
| 7 | Finetuning to Follow Instructions | Instruction tuning with Alpaca-style datasets |
| # | Appendix | Description |
|---|---|---|
| A | Introduction to PyTorch | PyTorch fundamentals: tensors, autograd, datasets, and training |
| B | References and Further Reading | Curated reading list and references |
| C | Exercise Solutions | Solutions for chapter exercises |
| D | Adding Bells and Whistles to the Training Loop | Cosine annealing, warmup, gradient clipping, and more |
| E | Parameter-efficient Finetuning with LoRA | Low-Rank Adaptation implemented from scratch |
Each chapter includes bonus notebooks and scripts that go well beyond the basics:
Chapter 2 — Working with Text Data
- Byte-pair encoder comparison (custom vs tiktoken)
- Embedding layers vs matrix multiplication equivalence
- Data loader intuition deep-dive
- BPE tokenizer from scratch
Chapter 3 — Coding Attention Mechanisms
- Efficient multi-head attention variants
- Understanding PyTorch buffers
Chapter 4 — Implementing a GPT Model
- Performance analysis and FLOPS estimation
- KV Cache implementation
- Grouped-Query Attention (GQA)
- Multi-Head Latent Attention (MLA)
- Sliding Window Attention (SWA)
- Mixture-of-Experts (MoE)
Chapter 5 — Pretraining on Unlabeled Data
- Alternative weight loading strategies
- Pretraining on Project Gutenberg data
- Learning rate schedulers and gradient clipping
- Hyperparameter tuning
- GPT → Llama 3.2 architecture conversion
- Memory-efficient weight loading
- Extending tokenizers
- LLM training speed optimizations
- Qwen3 from scratch (0.6B dense + 30B-A3B MoE)
- Gemma 3 from scratch (1B)
- Interactive chat UI
Chapter 6 — Finetuning for Classification
- Additional experiments (last vs first token, extended context)
- IMDb movie review classification
- Interactive classification UI
Chapter 7 — Finetuning to Follow Instructions
- Dataset preparation utilities
- Model evaluation (local Llama 3 + GPT-4 API)
- Direct Preference Optimization (DPO) from scratch
- Synthetic dataset generation
- Interactive instruction-following UI
├── ch01/ - ch07/ # Main chapters (notebooks + scripts)
├── appendix-A/ - E/ # Supplementary material
├── pkg/llms_from_scratch/ # Installable Python package
│ ├── ch02.py - ch07.py # Chapter implementations as modules
│ ├── llama3.py # Llama 3 architecture
│ ├── qwen3.py # Qwen3 architecture
│ ├── generate.py # Text generation utilities
│ ├── kv_cache/ # KV cache implementations
│ └── tests/ # Test suite
├── setup/ # Environment setup guides
├── reasoning-from-scratch/ # Reasoning model experiments
├── pyproject.toml # Project config & dependencies
├── pixi.toml # Conda-based environment manager
└── requirements.txt # Pip requirements (quick reference)
- Python
>=3.10, <3.14 - PyTorch
>=2.2.2
Option 1 — pip (recommended for most users)
pip install -r requirements.txtOption 2 — Install as a package
pip install -e ./pkgThen import anywhere:
from llms_from_scratch.ch04 import GPTModel
from llms_from_scratch.generate import generateOption 3 — pixi (reproducible conda environment)
pixi install
pixi run jupyter labjupyter labOpen any chapter notebook (e.g. ch02/01_main-chapter-code/ch02.ipynb) and run cells sequentially.
| Category | Stack |
|---|---|
| Deep Learning | PyTorch |
| Tokenization | tiktoken, sentencepiece, Hugging Face tokenizers |
| Model Formats | safetensors, PyTorch state_dict |
| Notebooks | JupyterLab |
| Testing | pytest, nbval |
| APIs | OpenAI, Hugging Face Hub |
| Interactive UI | Chainlit |
| Data & Viz | numpy, pandas, matplotlib, scikit-learn |
| Model | Type | Location |
|---|---|---|
| GPT-2 (124M) | Dense transformer | Ch04 – Ch07 (core) |
| Llama 3.2 | Dense transformer + RoPE + GQA | Ch05 bonus |
| Qwen3 0.6B | Dense transformer | Ch05 bonus |
| Qwen3 30B-A3B | Mixture-of-Experts | Ch05 bonus |
| Gemma 3 1B | Dense transformer + SWA | Ch05 bonus |
- KV Cache — inference-time memory optimization for autoregressive generation
- Grouped-Query Attention (GQA) — reduces KV heads for efficiency
- Multi-Head Latent Attention (MLA) — compressed latent KV projections
- Sliding Window Attention (SWA) — bounded context for long sequences
- Mixture-of-Experts (MoE) — sparse activation with expert routing
- LoRA — parameter-efficient finetuning via low-rank adapters
- DPO — alignment without a reward model
| Environment | Notes |
|---|---|
| Local (CPU) | Works on a laptop — no GPU required |
| Local (GPU) | CUDA or Apple MPS for faster training |
| Google Colab | Free tier works for most chapters |
| Docker | DevContainer config included in setup/ |
| AWS SageMaker | CloudFormation template in setup/ |
This project is licensed under the Apache License 2.0.