Skip to content

Commit f888189

Browse files
committed
rename
1 parent 8ec0357 commit f888189

File tree

4 files changed

+10
-12
lines changed

4 files changed

+10
-12
lines changed

README.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,23 +16,22 @@
1616
Implements: **cleaning → fidel decomposition → BPE training/application → detokenization**, with a **Cython core for speed**.
1717

1818
---
19-
## What's new in v0.2.2
19+
## What's new in v0.2.3
2020
1. **Pretrained tokenizer loading**
2121

2222
- You can now load a pretrained tokenizer directly:
2323

2424
```python
2525
from amharic_tokenizer import AmharicTokenizer
26-
tok = AmharicTokenizer.load("amh_bpe_v0.2.2")
26+
tok = AmharicTokenizer.load("amh_bpe_v0.2.3")
2727
```
28-
This version includes a pretrained model (`amh_bpe_v0.2.2`) that can be used immediately without any additional setup and training.
28+
This version includes a pretrained model (`amh_bpe_v0.2.3`) that can be used immediately without any additional setup and training.
2929

3030
2. **Full token-to-ID and ID-to-token functionality**
3131
- Added complete round-trip processing methods:
3232
```python
3333
tokens = tok.tokenize(text)
34-
ids = tok.convert_tokens_to_ids(tokens)
35-
tokens_from_ids = tok.convert_ids_to_tokens(ids)
34+
ids = tok.encode(tokens)
3635
detokenized = tok.detokenize(tokens)
3736
```
3837
The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.
@@ -45,21 +44,20 @@ text = "ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አ
4544

4645
tokens = tok.tokenize(text)
4746
ids = tok.convert_tokens_to_ids(tokens)
48-
tokens_from_ids = tok.convert_ids_to_tokens(ids)
47+
tokens = tok.decode(ids)
4948
detokenized = tok.detokenize(tokens)
5049

5150
print("Tokens:", tokens)
5251
print("IDs:", ids)
53-
print("Tokens from IDs:", tokens_from_ids)
5452
print("Detokenized:", detokenized)
5553

5654
Output:
5755
Tokens:
58-
['ሰእወኢ', '##', '##እነ', '##', '<eow>', ' ', 'ከአ', '##ኢተእየኦጰእ', '##', '##', '<eow>', ' ', 'ገኣ', '##', '##', '<eow>', ... ]
56+
['ሰእወኢ', '', 'እነ', '', '<eow>', ' ', 'ከአ', 'ኢተእየኦጰእ', '', '', '<eow>', ' ', 'ገኣ', '', '', '<eow>', ... ]
5957
IDs:
6058
[56252, 191975, 123541, 121977, 9863, 4, 134750, 119975, 156339, 120755, ...]
6159
Tokens from IDs:
62-
['ሰእወኢ', '##', '##እነ', '##', '<eow>', ...]
60+
['ሰእወኢ', '', 'እነ', '', '<eow>', ...]
6361
Detokenized:
6462
ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል
6563
```
@@ -126,7 +124,7 @@ tokenizer = AmharicTokenizer.load("amh_bpe_model")
126124
from amharic_tokenizer import AmharicTokenizer
127125

128126
# Load a trained model
129-
tok = AmharicTokenizer.load("amh_bpe_v0.2.2")
127+
tok = AmharicTokenizer.load("amh_bpe_v0.2.3")
130128

131129
text = "ኢትዮጵያ ጥሩ ናት።"
132130

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ python_files = ["test_*.py"]
2222

2323
[project]
2424
name = "amharic-tokenizer"
25-
version = "0.2.2"
25+
version = "0.2.3"
2626
description = "Amharic tokenizer with BPE-like merges over decomposed fidel (Cython)"
2727
readme = "README.md"
2828
requires-python = ">=3.8"

tests/test_basic.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
def test_roundtrip_basic():
77
"""Load a trained tokenizer, tokenize text, convert to IDs, and detokenize."""
8-
tok = AmharicTokenizer.load("amh_bpe_v0.2.2")
8+
tok = AmharicTokenizer.load("amh_bpe_v0.2.3")
99
text = (
1010
"የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ" )
1111

0 commit comments

Comments
 (0)