Skip to content

Latest: v0.2.5

Latest

Choose a tag to compare

@sefineh-ai sefineh-ai released this 11 Nov 18:02
· 2 commits to main since this release

What's new in v0.2.5

  • Vocab size: 100000 tokens
  • Trained on a larger and more diverse Amharic corpus
  • Improved tokenization quality and detokenization accuracy
  • Better handling of edge cases and rare words
  1. Pretrained tokenizer loading
  • You can now load a pretrained tokenizer directly:
from amharic_tokenizer import AmharicTokenizer
tok = AmharicTokenizer.load("amh_bpe_v0.2.5")

This version includes a pretrained model (amh_bpe_v0.2.5) that can be used immediately without any additional setup and training.

  1. Full token-to-ID and ID-to-token functionality
  • Added complete round-trip processing methods:
tokens = tok.tokenize(text)
ids = tok.encode(tokens)
detokenized = tok.detokenize(tokens)

The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.


Test Script: test_roundtrip_basic.py

from amharic_tokenizer import AmharicTokenizer
def test_roundtrip_basic():
    """Load a trained tokenizer, tokenize text, convert to IDs, and detokenize."""
    tok = AmharicTokenizer.load("amh_bpe_v0.2.5")
    text = (
        "የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ" )

    tokens = tok.tokenize(text)
    ids = tok.encode(text)
    detokenized = tok.detokenize(tokens)
    print("Original Text: ", text)
    print("Tokens: ", tokens)
    print("IDs: ", ids)
    print("Detokenized Text: ", detokenized)
    assert text == detokenized, "Detokenized text does not match the original."
if __name__ == "__main__":
    test_roundtrip_basic()

Output:    
    Tokenizer state loaded from amh_bpe_v0.2.4.json
    Original Text:  የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ
    Tokens:  ['የአከኦ', 'ረኢደአረእ<eow>', 'ለእመኣተእ<eow>', 'ገአ', 'ፀ', 'አ<eow>', 'በአረአ', 'ከአተእ<eow>', 'የአሀኦነ', 'ኣቸአወእ<eow>', 'የአ', 'ከአተአመኣ', 'ቸእነእ<eow>', 'ሰአፈአረ', 'ኦቸእ<eow>', 'በአ', 'ነአወኣረኢወኦቸእ<eow>', 'አነእደአ', 'በአተእ<eow>', 'በአሰአ', 'ዓተእ<eow>', '2', '0', '9', '<eow>', 'ከኢለኦ<eow>', 'መኤተእረእ<eow>', 'የአመኢ', 'ጓ', 'ዘ', 'አወእ<eow>', 'አወ', 'እለኦ<eow>', 'ነእ', 'ፈኣ', 'ሰእ<eow>', 'ከአ', 'ጀኣ', 'መኣየእ', 'ከኣ<eow>', 'ቀአጠእለኦ<eow>', 'ከኡ', 'በኣ<eow>', 'ደአረእሰ', 'ኡኣለእ<eow>', 'ጠአቀእለኣየእ<eow>']
    IDs:  [2794, 4229, 1136, 66, 37, 79, 711, 1556, 1480, 116, 43, 1467, 1162, 4664, 68, 45, 1618, 2182, 219, 1831, 879, 1, 1, 1, 0, 2824, 2684, 95, 1, 27, 58, 46, 4373, 67, 206, 83, 62, 1083, 4653, 230, 3916, 191, 202, 1221, 477, 496]
    Detokenized Text:  የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ