Skip to content

ishikawa/nue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

56 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

nue

workflow

The homebrew-scale LLM ๐Ÿต๐Ÿฆ๐Ÿฏ๐Ÿ

Warning

Work in progress โ€” nothing useful here yet.

I'd like to gain practical experience with transformers, particularly by understanding their architecture and real-world applications, with a focus on small-scale LLMs. To achieve this, I decided to create a tiny LLM. My goal is to integrate it into web applications, games, and iOS apps that interest me.

Goal

Build a small language model that generates grammatically correct sentences.

Philosophy

Keep it simple, but not simplistic.

Dataset

ๅๅ‰ ใƒฉใ‚คใ‚ปใƒณใ‚น ๅ‚™่€ƒ
Wikimedia Wikipedia CC BY-SA 3.0
livedoor ใƒ‹ใƒฅใƒผใ‚นใ‚ณใƒผใƒ‘ใ‚น CC BY-ND 2.1 JP ใƒ‡ใƒผใ‚ฟใ‚ฝใƒผใ‚นใฏ llm-book/livedoor-news-corpus

Tokenizer

ๆ—ฅ่‹ฑใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’็”จใ„ใฆ SentencePiece + Unigram ใงๅญฆ็ฟ’ใ—ใพใ™ใ€‚

  • byte_fallback=True ใง OOV (่ชžๅฝ™ๅค–) ๅ›ž้ฟ
  • vocab_size ใฏ 32,000

(1) ใ‚ณใƒผใƒ‘ใ‚นใ‚’็”Ÿๆˆ

ไปฅไธ‹ใฎใ‚ณใƒžใƒณใƒ‰ใ‚’ๅฎŸ่กŒใ™ใ‚‹ใจใ€ build/corpus.txt ใŒ็”Ÿๆˆใ•ใ‚Œใพใ™ใ€‚

$ poetry run nue build-corpus

(2) Tokenizer ใ‚’ๅญฆ็ฟ’

ไปฅไธ‹ใฎใ‚ณใƒžใƒณใƒ‰ใ‚’ๅฎŸ่กŒใ™ใ‚‹ใจใ€ build/tokenizer.model ใจ build/tokenizer.vocab ใŒ็”Ÿๆˆใ•ใ‚Œใพใ™ใ€‚

$ poetry run nue train-tokenizer

Training

poetry run nue train

License

Apache License 2.0

Name origin ๐Ÿต๐Ÿฆ๐Ÿฏ๐Ÿ

The name Nue (้ตบ, pronounced "noo-eh") originates from Japanese folklore, referring to a mythical creature with the head of a monkey, body of a tanuki (raccoon dog), legs of a tiger, and tail of a snake.

Inspired by this creature, this project serves as a personal exploration into building language modelsโ€”combining newly-learned, diverse techniques into one cohesive yet eclectic model. Additionally, "Nue" can be read similarly to "new," symbolizing this as a new, homebrew-scale LLM crafted from scratch.

References

Excellent articles and papers that I've read:

About

The homebrew-scale LLM ๐Ÿต๐Ÿฆ๐Ÿฏ๐Ÿ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published