Mini Transformer LLM From Scratch in PyTorch

A small, readable educational project that implements a decoder-only Transformer from scratch in PyTorch.

It covers:

Decoder-only Transformer
Mixture of Experts (MoE)
Dense MoE vs Sparse MoE
Token-level routing
Routing collapse + load-balancing loss
Next-token prediction with sampling
Temperature
KV cache
Grouped-Query Attention (GQA)
Latent KV cache compression
Pretraining
Finetuning
Preference tuning
Objectives for pretraining / finetuning / preference tuning
Data mixtures
Number of parameters
Vocabulary size
Number of training tokens
Supervised finetuning (SFT)
LoRA finetuning

Also included:

Tokenizer from scratch
Attention equation
Positional embeddings
RoPE
LayerNorm / RMSNorm
Cross-entropy loss
DPO-style preference tuning
Simple generation
Debug-friendly code

Project idea

We build a tiny GPT-like model and train it in three stages:

Stage 1: Pretraining
Tiny raw text -> next-token prediction

Stage 2: Supervised Fine-Tuning (SFT)
Instruction-answer examples -> assistant behavior

Stage 3: Preference tuning
Chosen vs rejected answers -> prefer better responses

This project is intentionally small so you can run it on a laptop.

Install

pip install torch tqdm

CPU is enough. GPU is optional.

Run everything

python train.py --stage all

Run stages separately

python train.py --stage pretrain
python train.py --stage sft
python train.py --stage dpo
python generate.py --prompt "Question: What is a transformer?"

Useful debug commands

Use a tiny run:

python train.py --stage all --max_steps 30 --batch_size 8 --block_size 64

Print model size:

python inspect_model.py

Files

src/tokenizer.py       Character tokenizer from scratch
src/model.py           Decoder-only Transformer, attention, GQA, MoE, LoRA, KV cache
src/data.py            Tiny datasets and batching
src/train_utils.py     Losses, training loop, generation helpers
train.py               Pretraining, SFT, DPO
generate.py            Generate text
inspect_model.py       Count parameters and explain config

Why character tokenizer?

A real LLM uses BPE/SentencePiece tokenization. Here we use character-level tokenization because:

it is fully from scratch
it needs no external files
it is easy to debug
it runs on a laptop

The idea is the same:

text -> token ids -> embeddings -> transformer -> logits -> probabilities

Expected result

This is a tiny educational model, not ChatGPT. After a short run, it should learn simple patterns from the included dataset and produce rough instruction-style answers.

The value of this project is not final performance. The value is that the code clearly shows the full LLM lifecycle:

pretraining -> SFT -> preference tuning -> inference optimization

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
generate.py		generate.py
inspect_model.py		inspect_model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Transformer LLM From Scratch in PyTorch

Project idea

Install

Run everything

Run stages separately

Useful debug commands

Files

Why character tokenizer?

Expected result

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini Transformer LLM From Scratch in PyTorch

Project idea

Install

Run everything

Run stages separately

Useful debug commands

Files

Why character tokenizer?

Expected result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages