1616Implements: ** cleaning → fidel decomposition → BPE training/application → detokenization** , with a ** Cython core for speed** .
1717
1818---
19- ## What's new in v0.2.2
19+ ## What's new in v0.2.3
20201 . ** Pretrained tokenizer loading**
2121
2222 - You can now load a pretrained tokenizer directly:
2323
2424 ``` python
2525 from amharic_tokenizer import AmharicTokenizer
26- tok = AmharicTokenizer.load(" amh_bpe_v0.2.2 " )
26+ tok = AmharicTokenizer.load(" amh_bpe_v0.2.3 " )
2727 ```
28- This version includes a pretrained model (` amh_bpe_v0.2.2 ` ) that can be used immediately without any additional setup and training.
28+ This version includes a pretrained model (` amh_bpe_v0.2.3 ` ) that can be used immediately without any additional setup and training.
2929
30302 . ** Full token-to-ID and ID-to-token functionality**
3131 - Added complete round-trip processing methods:
3232 ``` python
3333 tokens = tok.tokenize(text)
34- ids = tok.convert_tokens_to_ids(tokens)
35- tokens_from_ids = tok.convert_ids_to_tokens(ids)
34+ ids = tok.encode(tokens)
3635 detokenized = tok.detokenize(tokens)
3736 ```
3837 The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.
@@ -45,21 +44,20 @@ text = "ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አ
4544
4645tokens = tok.tokenize(text)
4746ids = tok.convert_tokens_to_ids(tokens)
48- tokens_from_ids = tok.convert_ids_to_tokens (ids)
47+ tokens = tok.decode (ids)
4948detokenized = tok.detokenize(tokens)
5049
5150print (" Tokens:" , tokens)
5251print (" IDs:" , ids)
53- print (" Tokens from IDs:" , tokens_from_ids)
5452print (" Detokenized:" , detokenized)
5553
5654Output:
5755 Tokens:
58- [' ሰእወኢ' , ' ## ደ' , ' ## እነ' , ' ## እ' , ' <eow>' , ' ' , ' ከአ' , ' ## ኢተእየኦጰእ' , ' ## የ' , ' ## ኣ' , ' <eow>' , ' ' , ' ገኣ' , ' ## ረ' , ' ## እ' , ' <eow>' , ... ]
56+ [' ሰእወኢ' , ' ደ' , ' እነ' , ' እ' , ' <eow>' , ' ' , ' ከአ' , ' ኢተእየኦጰእ' , ' የ' , ' ኣ' , ' <eow>' , ' ' , ' ገኣ' , ' ረ' , ' እ' , ' <eow>' , ... ]
5957 IDs:
6058 [56252 , 191975 , 123541 , 121977 , 9863 , 4 , 134750 , 119975 , 156339 , 120755 , ... ]
6159 Tokens from IDs:
62- [' ሰእወኢ' , ' ## ደ' , ' ## እነ' , ' ## እ' , ' <eow>' , ... ]
60+ [' ሰእወኢ' , ' ደ' , ' እነ' , ' እ' , ' <eow>' , ... ]
6361 Detokenized:
6462 ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል
6563```
@@ -126,7 +124,7 @@ tokenizer = AmharicTokenizer.load("amh_bpe_model")
126124from amharic_tokenizer import AmharicTokenizer
127125
128126# Load a trained model
129- tok = AmharicTokenizer.load(" amh_bpe_v0.2.2 " )
127+ tok = AmharicTokenizer.load(" amh_bpe_v0.2.3 " )
130128
131129text = " ኢትዮጵያ ጥሩ ናት።"
132130
0 commit comments