Skip to content

RajdeepKushwaha5/LLMs-from-scratch

Repository files navigation

LLMs from Scratch

A step-by-step implementation of a GPT-like large language model from the ground up — covering everything from tokenization and attention mechanisms to pretraining, finetuning, and alignment. Built entirely in PyTorch with no high-level wrappers, so every component is transparent and hackable.

Beyond the core GPT implementation, this repo also includes from-scratch builds of Llama 3.2, Qwen3, and Gemma 3, plus modern techniques like LoRA, DPO, KV Cache, GQA, MoE, and Sliding Window Attention.


Table of Contents

# Chapter Description
1 Understanding Large Language Models LLM fundamentals, development lifecycle, and landscape overview
2 Working with Text Data Tokenization, byte-pair encoding, data loaders, and embedding layers
3 Coding Attention Mechanisms Self-attention, causal masking, and multi-head attention from scratch
4 Implementing a GPT Model Full GPT architecture — transformer blocks, layer norm, text generation
5 Pretraining on Unlabeled Data Training loop, loss computation, weight loading, and text generation
6 Finetuning for Classification Adapting the pretrained model for spam classification
7 Finetuning to Follow Instructions Instruction tuning with Alpaca-style datasets
# Appendix Description
A Introduction to PyTorch PyTorch fundamentals: tensors, autograd, datasets, and training
B References and Further Reading Curated reading list and references
C Exercise Solutions Solutions for chapter exercises
D Adding Bells and Whistles to the Training Loop Cosine annealing, warmup, gradient clipping, and more
E Parameter-efficient Finetuning with LoRA Low-Rank Adaptation implemented from scratch

Bonus Content

Each chapter includes bonus notebooks and scripts that go well beyond the basics:

Chapter 2 — Working with Text Data
  • Byte-pair encoder comparison (custom vs tiktoken)
  • Embedding layers vs matrix multiplication equivalence
  • Data loader intuition deep-dive
  • BPE tokenizer from scratch
Chapter 3 — Coding Attention Mechanisms
  • Efficient multi-head attention variants
  • Understanding PyTorch buffers
Chapter 4 — Implementing a GPT Model
  • Performance analysis and FLOPS estimation
  • KV Cache implementation
  • Grouped-Query Attention (GQA)
  • Multi-Head Latent Attention (MLA)
  • Sliding Window Attention (SWA)
  • Mixture-of-Experts (MoE)
Chapter 5 — Pretraining on Unlabeled Data
  • Alternative weight loading strategies
  • Pretraining on Project Gutenberg data
  • Learning rate schedulers and gradient clipping
  • Hyperparameter tuning
  • GPT → Llama 3.2 architecture conversion
  • Memory-efficient weight loading
  • Extending tokenizers
  • LLM training speed optimizations
  • Qwen3 from scratch (0.6B dense + 30B-A3B MoE)
  • Gemma 3 from scratch (1B)
  • Interactive chat UI
Chapter 6 — Finetuning for Classification
  • Additional experiments (last vs first token, extended context)
  • IMDb movie review classification
  • Interactive classification UI
Chapter 7 — Finetuning to Follow Instructions
  • Dataset preparation utilities
  • Model evaluation (local Llama 3 + GPT-4 API)
  • Direct Preference Optimization (DPO) from scratch
  • Synthetic dataset generation
  • Interactive instruction-following UI

Project Structure

├── ch01/ - ch07/          # Main chapters (notebooks + scripts)
├── appendix-A/ - E/       # Supplementary material
├── pkg/llms_from_scratch/  # Installable Python package
│   ├── ch02.py - ch07.py  # Chapter implementations as modules
│   ├── llama3.py           # Llama 3 architecture
│   ├── qwen3.py            # Qwen3 architecture
│   ├── generate.py         # Text generation utilities
│   ├── kv_cache/           # KV cache implementations
│   └── tests/              # Test suite
├── setup/                  # Environment setup guides
├── reasoning-from-scratch/ # Reasoning model experiments
├── pyproject.toml          # Project config & dependencies
├── pixi.toml               # Conda-based environment manager
└── requirements.txt        # Pip requirements (quick reference)

Getting Started

Requirements

  • Python >=3.10, <3.14
  • PyTorch >=2.2.2

Installation

Option 1 — pip (recommended for most users)

pip install -r requirements.txt

Option 2 — Install as a package

pip install -e ./pkg

Then import anywhere:

from llms_from_scratch.ch04 import GPTModel
from llms_from_scratch.generate import generate

Option 3 — pixi (reproducible conda environment)

pixi install
pixi run jupyter lab

Running the Notebooks

jupyter lab

Open any chapter notebook (e.g. ch02/01_main-chapter-code/ch02.ipynb) and run cells sequentially.


Key Technologies

Category Stack
Deep Learning PyTorch
Tokenization tiktoken, sentencepiece, Hugging Face tokenizers
Model Formats safetensors, PyTorch state_dict
Notebooks JupyterLab
Testing pytest, nbval
APIs OpenAI, Hugging Face Hub
Interactive UI Chainlit
Data & Viz numpy, pandas, matplotlib, scikit-learn

Model Architectures Implemented

Model Type Location
GPT-2 (124M) Dense transformer Ch04 – Ch07 (core)
Llama 3.2 Dense transformer + RoPE + GQA Ch05 bonus
Qwen3 0.6B Dense transformer Ch05 bonus
Qwen3 30B-A3B Mixture-of-Experts Ch05 bonus
Gemma 3 1B Dense transformer + SWA Ch05 bonus

Advanced Techniques

  • KV Cache — inference-time memory optimization for autoregressive generation
  • Grouped-Query Attention (GQA) — reduces KV heads for efficiency
  • Multi-Head Latent Attention (MLA) — compressed latent KV projections
  • Sliding Window Attention (SWA) — bounded context for long sequences
  • Mixture-of-Experts (MoE) — sparse activation with expert routing
  • LoRA — parameter-efficient finetuning via low-rank adapters
  • DPO — alignment without a reward model

Supported Environments

Environment Notes
Local (CPU) Works on a laptop — no GPU required
Local (GPU) CUDA or Apple MPS for faster training
Google Colab Free tier works for most chapters
Docker DevContainer config included in setup/
AWS SageMaker CloudFormation template in setup/

License

This project is licensed under the Apache License 2.0.

About

This repository contains the code for developing, pretraining, and finetuning a GPT-like LLM from scratch, step by step.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors