Skip to content

lora-sys/nanochat-studay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nano-chat-study

nanochat studay

Nanochat study

## This project is for learning the nanochat concept

Inspired by Karpathy's amazing work on building a complete LLM pipeline.

Reference

  • Original Project: karpathy/nanochat
  • Goal: Replicate the end-to-end training (Pretrain -> SFT -> RL) for ~$100 in 4 hours.

Learning by building...


please use uv for package manager , is very fast and easy to use

pip install uv
npm install uv
uv venv nanochat
source nanochat/bin/activate


Basictokenizor Tokenizor

basictokenizor

tokenizor

tokenizor  = BasicTokenizor
text = """
en/chinese/emoji/.../[]/:/,/./?/>/</|/{/}/+/_ /etc.....

"""

- encode()
-  decode()


tokenizor.train()
uv run tokenizor/bpe.py

train your basic tokenizor


Regex Tokenizor

This is use gpt4 spilt pattern text to chunk , to complement squense means

minbpe

core

GPT2_SPLIT_PATTERN = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

About

This project is for learning the nanochat concept

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages