BPE Tokenizer for LLMs.
The training was done on Bookcorpus text. The tokenizer part makes the model uncased.
Unique Vocabulary Size = 35,000
Check out the bpe_train.py for training purposes.
0.001 : Contains the base tokenizer trained on BookCorpus. 150,000 unique tokens.
next : Even larger vocab
next : Adaptive on domains
Quite small number of test have been done. Check out at test.ipynb file for examples.
This tokenizer was built mainly for tineeBERT which is my current project. If you find issues. Please submit in Issues tab.