Educational implementation of a GPT-style language model in PyTorch, with training, instruction fine-tuning, and basic inference utilities.
- GPT-style Transformer model and layers (
src/models/) - Tokenization and dataloaders (
src/data/) - Training loop, evaluation, and text generation (
src/training/) - YAML-based configuration (
configs/) - TensorBoard logging (
runs/) and checkpoint outputs (artifacts/)
- Python 3.11+
uvfor environment and dependency management
Install dependencies:
uv syncThe training scripts expect text data under the top-level data/ directory.
- Pretraining data:
scripts/train.pyreadsdata/fineweb_samples.txt. - Instruction data:
scripts/train_instruct.pyreadsdata/alpaca_data.json.
To generate a FineWeb text file via Hugging Face Datasets:
uv run python scripts/data/generate_data.pyThis generates data/fineweb.txt. If you want to use it with scripts/train.py without editing code, create data/fineweb_samples.txt (for example, by copying or sampling from data/fineweb.txt).
Run the default training configuration:
uv run python -m scripts.trainSelect a config file via the CFG environment variable:
CFG=configs/gpt_124m.yaml uv run python -m scripts.trainOutputs:
- Checkpoints:
artifacts/ - TensorBoard logs:
runs/
View TensorBoard:
uv run tensorboard --logdir runsThe instruction fine-tuning entrypoint is scripts/train_instruct.py. It expects:
- Instruction dataset at
data/alpaca_data.json - A base model checkpoint in
artifacts/(see the path defined inscripts/train_instruct.py)
Run:
CFG=configs/gpt2_35m_4heads_12layers_finetuning.yaml uv run python -m scripts.train_instructThe inference entrypoint is scripts/infere.py and loads a checkpoint from artifacts/ (see the path defined in the script).
Run:
uv run python -m scripts.infereconfigs/: model and training YAML configurationsscripts/: CLI entrypoints (training, fine-tuning, inference)src/: library code (model, data, training, utilities)notebooks/: experiments and analysis notebookstests/: ad hoc test and utility scripts
Format and lint with Ruff:
uv run ruff format .
uv run ruff check .