This is a llm pretraining of GPT2 124M architecture on a small ./data dataset from scratch. Trained model can generate text for a given token size.
To run and train on low-end cpu only machines, gpt2 architecture pretrained with 256 context size and on a small short-story book text data ./data/the-verdict.txt . Alternatively, can use OpenAI gpt2 pretrained weights mentioned in below section. Model architecture configuration is given model_info.txt.
Datasets consists of 5145 tokens, 4608 token are used in training set.
Pre-requisites are python<=3.13 and uv package manger, instructions to set up can be found here.
-
Clone this repository
Either by download as zip option or by
git clone https://github.com/lukmanulhakeem97/llm-pretraining.gitcommand in CLI tool. -
Create an python environment and install dependencies
create environment:
uv venv [name], name is optional.Navigate to cloned repo directory and install dependency given in
pyproject.tomlfile:cd llm-pretraining,uv sync. -
Activate
venvby.\.venv\Scripts\activate
Generate text:
-
Download pretrained
model.pthfrom my huggingfaceHub and place it on clonedllm-pretrainingpath. -
Run
inference.pywith any starting promptby using
model.pth:uv run inference.py "Be now, then will be ".by using OpenAI gpt2 pretrained weights:
uv run inference.py "Be now, then will be " --load_openaigpt2_weight="yes".
Pretraining:
- Run
uv run train.py, will generatemodel.pth.