Skip to content

Latest commit

 

History

History
88 lines (54 loc) · 1.46 KB

File metadata and controls

88 lines (54 loc) · 1.46 KB

nano-chat-study

nanochat studay

Nanochat study

## This project is for learning the nanochat concept

Inspired by Karpathy's amazing work on building a complete LLM pipeline.

Reference

  • Original Project: karpathy/nanochat
  • Goal: Replicate the end-to-end training (Pretrain -> SFT -> RL) for ~$100 in 4 hours.

Learning by building...


please use uv for package manager , is very fast and easy to use

pip install uv
npm install uv
uv venv nanochat
source nanochat/bin/activate


Basictokenizor Tokenizor

basictokenizor

tokenizor

tokenizor  = BasicTokenizor
text = """
en/chinese/emoji/.../[]/:/,/./?/>/</|/{/}/+/_ /etc.....

"""

- encode()
-  decode()


tokenizor.train()
uv run tokenizor/bpe.py

train your basic tokenizor


Regex Tokenizor

This is use gpt4 spilt pattern text to chunk , to complement squense means

minbpe

core

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""