Skip to content

Commit 7cbcfff

Browse files
committed
V0.2.5 and amharic training data is included
1 parent 19ac678 commit 7cbcfff

File tree

8 files changed

+40214
-8
lines changed

8 files changed

+40214
-8
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ env/
4444
htmlcov/
4545
.tox/
4646
.mypy_cache/
47-
data_crawler
47+
# data_crawler
4848
scripts/
4949
amh_bpe_sample.json
5050
# Local config

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,20 @@
1616
Implements: **cleaning → fidel decomposition → BPE training/application → detokenization**, with a **Cython core for speed**.
1717

1818
---
19-
## What's new in v0.2.4
19+
## What's new in v0.2.5
20+
- Vocab size: 10000 tokens
21+
- Trained on a larger and more diverse Amharic corpus
22+
- Improved tokenization quality and detokenization accuracy
23+
- Better handling of edge cases and rare words
2024
1. **Pretrained tokenizer loading**
2125

2226
- You can now load a pretrained tokenizer directly:
2327

2428
```python
2529
from amharic_tokenizer import AmharicTokenizer
26-
tok = AmharicTokenizer.load("amh_bpe_v0.2.4")
30+
tok = AmharicTokenizer.load("amh_bpe_v0.2.5")
2731
```
28-
This version includes a pretrained model (`amh_bpe_v0.2.4`) that can be used immediately without any additional setup and training.
32+
This version includes a pretrained model (`amh_bpe_v0.2.5`) that can be used immediately without any additional setup and training.
2933

3034
2. **Full token-to-ID and ID-to-token functionality**
3135
- Added complete round-trip processing methods:

0 commit comments

Comments
 (0)