baby-gpt

Minimal char-level GPT in PyTorch: config, data utils, small causal Transformer, train.py, generate.py, and interactive chat.py.

Setup

pip install -r requirements.txt

Workflow

Put text in data/input.txt (the model only learns from this file).
Train (writes a checkpoint — path depends on profile, see below):

python train.py

Quick dry run: MAX_ITERS=100 python train.py (PowerShell: $env:MAX_ITERS="100"; python train.py).

~30 minutes on a laptop (CPU, no GPU)

You cannot run the full large model for thousands of steps in 30 minutes on a typical Lenovo / CPU-only machine — but you can fit a smaller model and hundreds of steps, which is enough to see clearer text than a 3-step smoke test.

Set the fast profile (smaller width/depth, block_size 64, saves models/baby_gpt_fast.pt).
Set step count from a quick timing run: $env:BABY_GPT_FAST="1"; $env:MAX_ITERS="10"; python train.py and note seconds ÷ 10 = seconds per step. For ~30 minutes, MAX_ITERS ≈ floor(1800 / seconds_per_step) (often ~250–500 on a laptop CPU — yours will vary).

PowerShell (example):

$env:BABY_GPT_FAST="1"
$env:MAX_ITERS="400"
python train.py

Chat / generate after a fast train — use the same profile so the default checkpoint path matches:

$env:BABY_GPT_FAST="1"
python chat.py

Or explicitly: python chat.py --checkpoint models/baby_gpt_fast.pt

Realistic expectation: output will not match a big GPU run or ChatGPT; you are trading quality for time. For the big model + long runs, unset BABY_GPT_FAST and use a GPU or run overnight.

Generate one continuation (prompt must use characters that appear in data/input.txt):

python generate.py --prompt "Part " --max-new 200

Interactive “chat” (really: repeated text continuation — not a real dialogue model):

python chat.py

If a character was never in training, you’ll get an error for that line.

Data

More text → better statistics (especially for “grammar-like” surface patterns). Put plain UTF-8 in data/input.txt (path in config.py).

Download a large public-domain corpus (recommended)

This repo includes scripts/fetch_corpus.py, which pulls several English novels from Project Gutenberg (polite delays + User-Agent). Only use if that matches how you want to source text.

python scripts/fetch_corpus.py

Optional: cap size (characters) for faster experiments:

python scripts/fetch_corpus.py --max-chars 1500000

An existing data/input.txt is copied to data/input.txt.bak before overwrite.

Training defaults in config.py assume a large corpus (block_size 128, deeper model). If you only have a tiny file, lower block_size / n_layer / n_embd or training may be slow for little gain.

Copy-paste your own books

You can paste econometrics or other material into data/input.txt, but copyright is your responsibility (keep local copies out of git if unsure). utils.clean_text (enabled via Config.clean_corpus) normalizes whitespace and Unicode from messy PDF exports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

baby-gpt

Setup

Workflow

~30 minutes on a laptop (CPU, no GPU)

Data

Download a large public-domain corpus (recommended)

Copy-paste your own books

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
outputs		outputs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
config.py		config.py
generate.py		generate.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

baby-gpt

Setup

Workflow

~30 minutes on a laptop (CPU, no GPU)

Data

Download a large public-domain corpus (recommended)

Copy-paste your own books

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages