|
16 | 16 | Implements: **cleaning → fidel decomposition → BPE training/application → detokenization**, with a **Cython core for speed**. |
17 | 17 |
|
18 | 18 | --- |
| 19 | +## What's new in v0.2.0 |
| 20 | +1. **Pretrained tokenizer loading** |
| 21 | + |
| 22 | + - You can now load a pretrained tokenizer directly: |
| 23 | + |
| 24 | + ```python |
| 25 | + from amharic_tokenizer import AmharicTokenizer |
| 26 | + tok = AmharicTokenizer.load("amh_bpe_v0.2.0") |
| 27 | + ``` |
| 28 | + This version includes a pretrained model (`amh_bpe_v0.2.0`) that can be used immediately without any additional setup and training. |
| 29 | + |
| 30 | +2. **Full token-to-ID and ID-to-token functionality** |
| 31 | + - Added complete round-trip processing methods: |
| 32 | + ```python |
| 33 | + tokens = tok.tokenize(text) |
| 34 | + ids = tok.convert_tokens_to_ids(tokens) |
| 35 | + tokens_from_ids = tok.convert_ids_to_tokens(ids) |
| 36 | + detokenized = tok.detokenize(tokens) |
| 37 | + ``` |
| 38 | + The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization. |
| 39 | +--- |
19 | 40 |
|
20 | | -## What's new in 0.1.2 |
| 41 | +### Example |
21 | 42 |
|
22 | | -- WordPiece-style continuation prefixes: non-initial subwords are now prefixed with `##` during tokenization. |
23 | | - - Example: `Going` → `['G', '##o', '##i', '##n', '##g', '</w>']` |
24 | | - - Amharic example: |
25 | | - Input: `የተባለ ውን የሚያደርገው ም በዚህ ምክንያት ነው` |
26 | | - Tokens: |
27 | | - ``` |
28 | | - ['የአተአ', '##በ', '##ኣለ', '##አ', '</w>', ' ', 'ወእ', '##ነ', '##እ', '</w>', ' ', 'የአመኢየኣ', '##ደ', '##አረ', '##እ', '##ገ', '##አወእ', '</w>', ' ', 'መእ', '</w>', ' ', 'በአ', '##ዘኢ', '##ሀ', '##እ', '</w>', ' ', 'መእ', '##ከ', '##እነእ', '##የኣ', '##ተእ', '</w>', ' ', 'ነ', '##አወእ', '</w>'] |
29 | | - ``` |
30 | | - Detokenization matches the input. |
31 | | -- Detokenization fixes: |
32 | | - - Strips `##` correctly and handles embedded `</w>` markers without leaking into text. |
33 | | - - Avoids extra spaces resulting from end-of-word handling. |
34 | | -- Developer ergonomics: `AmharicTokenizer.from_default()` returns a minimally trained instance for quick experiments. |
| 43 | +```python |
| 44 | +text = "ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል" |
| 45 | + |
| 46 | +tokens = tok.tokenize(text) |
| 47 | +ids = tok.convert_tokens_to_ids(tokens) |
| 48 | +tokens_from_ids = tok.convert_ids_to_tokens(ids) |
| 49 | +detokenized = tok.detokenize(tokens) |
35 | 50 |
|
36 | | -> Note: The `</w>` token remains an internal end-of-word marker in the token stream; it is never emitted in detokenized text. |
| 51 | +print("Tokens:", tokens) |
| 52 | +print("IDs:", ids) |
| 53 | +print("Tokens from IDs:", tokens_from_ids) |
| 54 | +print("Detokenized:", detokenized) |
37 | 55 |
|
| 56 | +Output: |
| 57 | + Tokens: |
| 58 | + ['ሰእወኢ', '##ደ', '##እነ', '##እ', '</w>', ' ', 'ከአ', '##ኢተእየኦጰእ', '##የ', '##ኣ', '</w>', ' ', 'ገኣ', '##ረ', '##እ', '</w>', ... ] |
| 59 | + IDs: |
| 60 | + [56252, 191975, 123541, 121977, 9863, 4, 134750, 119975, 156339, 120755, ...] |
| 61 | + Tokens from IDs: |
| 62 | + ['ሰእወኢ', '##ደ', '##እነ', '##እ', '</w>', ...] |
| 63 | + Detokenized: |
| 64 | + ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል |
| 65 | +``` |
| 66 | +### Additional Improvements |
| 67 | +* Added `vocab_size` property for inspecting model vocabulary. |
| 68 | +* Added `test_roundtrip_basic.py` example script for verifying tokenizer round-trip behavior. |
| 69 | +* Internal `</w>` token remains an end-of-word marker and is excluded from final detokenized output. |
38 | 70 | --- |
39 | 71 |
|
| 72 | + |
40 | 73 | ## Installation |
41 | 74 |
|
42 | 75 | ### From PyPI (recommended) |
|
0 commit comments