CustomBPE

BPE Tokenizer for LLMs.

Training

The training was done on Bookcorpus text. The tokenizer part makes the model uncased.

Unique Vocabulary Size = 35,000

Check out the bpe_train.py for training purposes.

Versions

0.001 : Contains the base tokenizer trained on BookCorpus. 150,000 unique tokens.

next : Even larger vocab

next : Adaptive on domains

Tests:

Quite small number of test have been done. Check out at test.ipynb file for examples.

Development

This tokenizer was built mainly for tineeBERT which is my current project. If you find issues. Please submit in Issues tab.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Data		Data
Model		Model
LICENSE		LICENSE
README.md		README.md
bpe_train.py		bpe_train.py
download_data.sh		download_data.sh
requirements.txt		requirements.txt
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CustomBPE

Training

Versions

Tests:

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CustomBPE

Training

Versions

Tests:

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages