Skip to content

Commit 810d743

Browse files
committed
Update README with new release version and update test to use new version
1 parent 8da6556 commit 810d743

File tree

1 file changed

+48
-15
lines changed

1 file changed

+48
-15
lines changed

README.md

Lines changed: 48 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,60 @@
1616
Implements: **cleaning → fidel decomposition → BPE training/application → detokenization**, with a **Cython core for speed**.
1717

1818
---
19+
## What's new in v0.2.0
20+
1. **Pretrained tokenizer loading**
21+
22+
- You can now load a pretrained tokenizer directly:
23+
24+
```python
25+
from amharic_tokenizer import AmharicTokenizer
26+
tok = AmharicTokenizer.load("amh_bpe_v0.2.0")
27+
```
28+
This version includes a pretrained model (`amh_bpe_v0.2.0`) that can be used immediately without any additional setup and training.
29+
30+
2. **Full token-to-ID and ID-to-token functionality**
31+
- Added complete round-trip processing methods:
32+
```python
33+
tokens = tok.tokenize(text)
34+
ids = tok.convert_tokens_to_ids(tokens)
35+
tokens_from_ids = tok.convert_ids_to_tokens(ids)
36+
detokenized = tok.detokenize(tokens)
37+
```
38+
The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.
39+
---
1940

20-
## What's new in 0.1.2
41+
### Example
2142

22-
- WordPiece-style continuation prefixes: non-initial subwords are now prefixed with `##` during tokenization.
23-
- Example: `Going``['G', '##o', '##i', '##n', '##g', '</w>']`
24-
- Amharic example:
25-
Input: `የተባለ ውን የሚያደርገው ም በዚህ ምክንያት ነው`
26-
Tokens:
27-
```
28-
['የአተአ', '##በ', '##ኣለ', '##አ', '</w>', ' ', 'ወእ', '##ነ', '##እ', '</w>', ' ', 'የአመኢየኣ', '##ደ', '##አረ', '##እ', '##ገ', '##አወእ', '</w>', ' ', 'መእ', '</w>', ' ', 'በአ', '##ዘኢ', '##ሀ', '##እ', '</w>', ' ', 'መእ', '##ከ', '##እነእ', '##የኣ', '##ተእ', '</w>', ' ', 'ነ', '##አወእ', '</w>']
29-
```
30-
Detokenization matches the input.
31-
- Detokenization fixes:
32-
- Strips `##` correctly and handles embedded `</w>` markers without leaking into text.
33-
- Avoids extra spaces resulting from end-of-word handling.
34-
- Developer ergonomics: `AmharicTokenizer.from_default()` returns a minimally trained instance for quick experiments.
43+
```python
44+
text = "ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል"
45+
46+
tokens = tok.tokenize(text)
47+
ids = tok.convert_tokens_to_ids(tokens)
48+
tokens_from_ids = tok.convert_ids_to_tokens(ids)
49+
detokenized = tok.detokenize(tokens)
3550

36-
> Note: The `</w>` token remains an internal end-of-word marker in the token stream; it is never emitted in detokenized text.
51+
print("Tokens:", tokens)
52+
print("IDs:", ids)
53+
print("Tokens from IDs:", tokens_from_ids)
54+
print("Detokenized:", detokenized)
3755

56+
Output:
57+
Tokens:
58+
['ሰእወኢ', '##ደ', '##እነ', '##እ', '</w>', ' ', 'ከአ', '##ኢተእየኦጰእ', '##የ', '##ኣ', '</w>', ' ', 'ገኣ', '##ረ', '##እ', '</w>', ... ]
59+
IDs:
60+
[56252, 191975, 123541, 121977, 9863, 4, 134750, 119975, 156339, 120755, ...]
61+
Tokens from IDs:
62+
['ሰእወኢ', '##ደ', '##እነ', '##እ', '</w>', ...]
63+
Detokenized:
64+
ስዊድን ከኢትዮጵያ ጋር ያላትን ግንኙነት አስመልክቶ አዲስ የትብብር ስልት መነደፉን አምባሳደሩ ገልጸዋል
65+
```
66+
### Additional Improvements
67+
* Added `vocab_size` property for inspecting model vocabulary.
68+
* Added `test_roundtrip_basic.py` example script for verifying tokenizer round-trip behavior.
69+
* Internal `</w>` token remains an end-of-word marker and is excluded from final detokenized output.
3870
---
3971

72+
4073
## Installation
4174

4275
### From PyPI (recommended)

0 commit comments

Comments
 (0)