The homebrew-scale LLM ๐ต๐ฆ๐ฏ๐
Warning
Work in progress โ nothing useful here yet.
I'd like to gain practical experience with transformers, particularly by understanding their architecture and real-world applications, with a focus on small-scale LLMs. To achieve this, I decided to create a tiny LLM. My goal is to integrate it into web applications, games, and iOS apps that interest me.
Build a small language model that generates grammatically correct sentences.
Keep it simple, but not simplistic.
ๅๅ | ใฉใคใปใณใน | ๅ่ |
---|---|---|
Wikimedia Wikipedia | CC BY-SA 3.0 | |
livedoor ใใฅใผในใณใผใใน | CC BY-ND 2.1 JP | ใใผใฟใฝใผในใฏ llm-book/livedoor-news-corpus |
ๆฅ่ฑใฎใณใผใในใ็จใใฆ SentencePiece + Unigram ใงๅญฆ็ฟใใพใใ
byte_fallback=True
ใง OOV (่ชๅฝๅค) ๅ้ฟvocab_size
ใฏ 32,000
(1) ใณใผใในใ็ๆ
ไปฅไธใฎใณใใณใใๅฎ่กใใใจใ build/corpus.txt
ใ็ๆใใใพใใ
$ poetry run nue build-corpus
(2) Tokenizer ใๅญฆ็ฟ
ไปฅไธใฎใณใใณใใๅฎ่กใใใจใ build/tokenizer.model
ใจ build/tokenizer.vocab
ใ็ๆใใใพใใ
$ poetry run nue train-tokenizer
poetry run nue train
Apache License 2.0
The name Nue (้ตบ, pronounced "noo-eh") originates from Japanese folklore, referring to a mythical creature with the head of a monkey, body of a tanuki (raccoon dog), legs of a tiger, and tail of a snake.
Inspired by this creature, this project serves as a personal exploration into building language modelsโcombining newly-learned, diverse techniques into one cohesive yet eclectic model. Additionally, "Nue" can be read similarly to "new," symbolizing this as a new, homebrew-scale LLM crafted from scratch.
Excellent articles and papers that I've read: