## This project is for learning the nanochat concept
Inspired by Karpathy's amazing work on building a complete LLM pipeline.
- Original Project: karpathy/nanochat
- Goal: Replicate the end-to-end training (Pretrain -> SFT -> RL) for ~$100 in 4 hours.
Learning by building...
please use uv for package manager , is very fast and easy to use
pip install uv
npm install uvuv venv nanochat
source nanochat/bin/activateBasictokenizor Tokenizor
tokenizor = BasicTokenizor
text = """
en/chinese/emoji/.../[]/:/,/./?/>/</|/{/}/+/_ /etc.....
"""
- encode()
- decode()
tokenizor.train()uv run tokenizor/bpe.pyRegex Tokenizor
This is use gpt4 spilt pattern text to chunk , to complement squense means
core
GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""