A small, readable educational project that implements a decoder-only Transformer from scratch in PyTorch.
It covers:
- Decoder-only Transformer
- Mixture of Experts (MoE)
- Dense MoE vs Sparse MoE
- Token-level routing
- Routing collapse + load-balancing loss
- Next-token prediction with sampling
- Temperature
- KV cache
- Grouped-Query Attention (GQA)
- Latent KV cache compression
- Pretraining
- Finetuning
- Preference tuning
- Objectives for pretraining / finetuning / preference tuning
- Data mixtures
- Number of parameters
- Vocabulary size
- Number of training tokens
- Supervised finetuning (SFT)
- LoRA finetuning
Also included:
- Tokenizer from scratch
- Attention equation
- Positional embeddings
- RoPE
- LayerNorm / RMSNorm
- Cross-entropy loss
- DPO-style preference tuning
- Simple generation
- Debug-friendly code
We build a tiny GPT-like model and train it in three stages:
Stage 1: Pretraining
Tiny raw text -> next-token prediction
Stage 2: Supervised Fine-Tuning (SFT)
Instruction-answer examples -> assistant behavior
Stage 3: Preference tuning
Chosen vs rejected answers -> prefer better responses
This project is intentionally small so you can run it on a laptop.
pip install torch tqdmCPU is enough. GPU is optional.
python train.py --stage allpython train.py --stage pretrain
python train.py --stage sft
python train.py --stage dpo
python generate.py --prompt "Question: What is a transformer?"Use a tiny run:
python train.py --stage all --max_steps 30 --batch_size 8 --block_size 64Print model size:
python inspect_model.pysrc/tokenizer.py Character tokenizer from scratch
src/model.py Decoder-only Transformer, attention, GQA, MoE, LoRA, KV cache
src/data.py Tiny datasets and batching
src/train_utils.py Losses, training loop, generation helpers
train.py Pretraining, SFT, DPO
generate.py Generate text
inspect_model.py Count parameters and explain config
A real LLM uses BPE/SentencePiece tokenization. Here we use character-level tokenization because:
- it is fully from scratch
- it needs no external files
- it is easy to debug
- it runs on a laptop
The idea is the same:
text -> token ids -> embeddings -> transformer -> logits -> probabilities
This is a tiny educational model, not ChatGPT. After a short run, it should learn simple patterns from the included dataset and produce rough instruction-style answers.
The value of this project is not final performance. The value is that the code clearly shows the full LLM lifecycle:
pretraining -> SFT -> preference tuning -> inference optimization