Skip to content

Commit 741df61

Browse files
authored
Release 0.0.3a1 (#2)
* Merge pull request #1 from TigreGotico/espeak_crf Espeak + CRF * Increment Version to 0.0.3a1 * Update Changelog --------- Co-authored-by: JarbasAI <33701864+JarbasAl@users.noreply.github.com> Co-authored-by: JarbasAl <JarbasAl@users.noreply.github.com>
2 parents 034c9c1 + 7ab6646 commit 741df61

10 files changed

Lines changed: 644 additions & 26 deletions

File tree

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Changelog
2+
3+
## [0.0.3a1](https://github.com/TigreGotico/mwl_phonemizer/tree/0.0.3a1) (2025-10-02)
4+
5+
[Full Changelog](https://github.com/TigreGotico/mwl_phonemizer/compare/0.0.2...0.0.3a1)
6+
7+
**Merged pull requests:**
8+
9+
- Espeak + CRF [\#1](https://github.com/TigreGotico/mwl_phonemizer/pull/1) ([JarbasAl](https://github.com/JarbasAl))
10+
11+
12+
13+
\* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)*

README.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,19 +20,15 @@ This repository contains a Python-based Mirandese phonemizer, designed to conver
2020
## **Usage**
2121

2222
```python
23-
# pick one
24-
from mwl_phonemizer.crf_mwl import CRFPhonemizer
25-
from mwl_phonemizer.epitran_mwl import EpitranMWL
26-
from mwl_phonemizer.espeak_mwl import EspeakMWL
27-
from mwl_phonemizer.ngram_mwl import NgramMWLPhonemizer
28-
from mwl_phonemizer.orthography_hand_rules import OrthographyRulesMWL
23+
from mwl_phonemizer.crf_espeak_mwl import CRFEspeakCorrector
24+
2925

3026
sample_texts = [
3127
"Muitas lhénguas ténen proua de ls sous pergaminos antigos, de la lhiteratura screbida hai cientos d'anhos i de scritores hai muito afamados, hoije bandeiras dessas lhénguas. Mas outras hai que nun puoden tener proua de nada desso, cumo ye l causo de la lhéngua mirandesa.",
3228
"Todos ls seres houmanos nácen lhibres i eiguales an honra i an dreitos. Dotados de rezon i de cuncéncia, dében de se dar bien uns culs outros i cumo armano",
3329
]
3430

35-
phonemizer = EpitranMWL()
31+
phonemizer = CRFEspeakCorrector()
3632
for text in sample_texts:
3733
print(f"Original: {text}")
3834
print(f"Phonemized: {phonemizer.phonemize_sentence(text)}\n")
@@ -68,12 +64,15 @@ print(f"Stress-Agnostic IPA: {stress_agnostic_ipa}")
6864
## **Phonemizer Comparison**
6965

7066
| Phonemizer | PER (Full IPA, Stress) | PER (Stress-Agnostic) | Words Incorrect (ED>0) | Notes |
71-
| --------------------- |------------------------|-----------------------| ---------------------- | --------------------------------------------------------- |
72-
| **CRF** | 20.25% | 20.76% | 117 | Character-level CRF trained on aligned word–phoneme pairs |
67+
|-----------------------|------------------------|-----------------------|------------------------|-----------------------------------------------------------|
68+
| **Espeak + CRF** | 59.98% → 3.72% | 39.51% → 4.26% | 35 | Espeak output corrected with a CRF model |
69+
| **Epitran + CRF** | 51.37% → 16.54% | 44.89% → 18.97% | 110 | Epitran output corrected with a CRF model |
70+
| **CRF** | 15.36% | 17.06% | 117 | Character-level CRF trained on aligned word–phoneme pairs |
71+
| **Orthography Rules** | 39.04% | 31.99% | 136 | Hand-crafted orthographic rules |
72+
| **Epitran + Rules** | 51.37% → 47.26% | 44.89% → 40.07% | 137 | Epitran output corrected with hand-crafted rules |
73+
| **Espeak + Rules** | 59.98% → 52.35% | 39.51% → 30.30% | 73 | Espeak output corrected with hand-crafted rules |
7374
| **N-gram (n=4)** | 43.93% | 30.98% | 141 | Statistical N-gram model for G2P conversion |
74-
| **Orthography Rules** | 39.04% | 31.99% | 136 | Handcrafted orthographic rules for all dialects |
75-
| **Epitran** | 51.37% → 47.26% | 44.89% → 40.07% | 145 | Epitran output corrected with Mirandese-specific rules |
76-
| **Espeak** | 59.98% → 52.35% | 39.51% → 30.30% | 145 | Espeak IPA output corrected with rules |
75+
| **Character lookup** | 43.84% | 36.92% | 142 | Simple letter/digraph to phoneme lookup table |
7776

7877
**Notes:**
7978

mwl_phonemizer/__init__.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
from mwl_phonemizer.espeak_mwl import EspeakMWL
44
from mwl_phonemizer.ngram_mwl import NgramMWLPhonemizer
55
from mwl_phonemizer.orthography_hand_rules import OrthographyRulesMWL
6+
from mwl_phonemizer.crf_espeak_mwl import CRFEspeakCorrector
7+
from mwl_phonemizer.crf_epitran_mwl import CRFEpitranCorrector
8+
from mwl_phonemizer.char_lookup_mwl import LookupTableMWL
9+
610

711

812
if __name__ == "__main__":
@@ -26,7 +30,7 @@
2630
L furdes ber, talbéç que stéia muôrto!"""
2731
]
2832

29-
phonemizer = EpitranMWL()
33+
phonemizer = CRFEspeakCorrector()
3034
for text in sample_texts:
3135
print(f"Original: {text}")
3236
print(f"Phonemized: {phonemizer.phonemize_sentence(text)}\n")

mwl_phonemizer/char_lookup_mwl.py

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
from mwl_phonemizer.base import MirandesePhonemizer
2+
3+
4+
class LookupTableMWL(MirandesePhonemizer):
5+
TILDE = "̃" # ◌̃
6+
7+
LETTERS = {
8+
"a": ["a", "ɐ"],
9+
"b": ["b", "β"],
10+
"c": ["k", "s"],
11+
"ç": ["s", "z"],
12+
"d": ["d", "ð"],
13+
"e": ["ɨ"],
14+
"é": ["ɛ"],
15+
"f": ["f"],
16+
"g": ["ɣ"],
17+
"h": [""], # silent
18+
"i": ["i", "j"],
19+
"j": ["ʒ"],
20+
"l": ["l", "ɫ"],
21+
"m": ["m", TILDE, ],
22+
"n": ["n", "ŋ", TILDE],
23+
"o": ["u", "o", "ʊ"],
24+
"ó": ["ɔ"],
25+
"p": ["p"],
26+
"q": ["k"],
27+
"r": ["ɾ"],
28+
"s": ["s̺", "z̺"],
29+
"t": ["t"],
30+
"u": ["u", "w", "ũ"],
31+
"x": ["ʃ"],
32+
"y": ["j"],
33+
"z": ["z"],
34+
35+
"A": ["ɐ̃ŋ"],
36+
"E": ["ẽŋ", "ɨ̃"],
37+
"I": ["ĩŋ"],
38+
"O": ["õŋ"],
39+
"R": ["r"],
40+
"S": ["s̺"],
41+
"U": ["ũŋ", "ʊ̃ŋ"],
42+
"Q": ["k"],
43+
"G": ["g"],
44+
"Ç": ["sɛ", "sɨ"],
45+
"C": ["s̻i"],
46+
"W": ["wo"],
47+
"Z": ["sk"],
48+
# "I": ["ɨ̃j̃"], # SENDINESE
49+
}
50+
51+
@staticmethod
52+
def normalize(sentence: str):
53+
# normalize short/long pauses to " " and "."
54+
sentence = (sentence.lower()
55+
.replace("\t", " ")
56+
.replace("-", " ")
57+
.replace(",", " ")
58+
.replace(";", " ")
59+
.replace(".", ".")
60+
.replace("!", ".")
61+
.replace("?", "."))
62+
63+
# temp representation of digraphs as individual letters
64+
DIMAP = {
65+
"an": "A",
66+
"en": "E",
67+
"in": "I",
68+
"on": "O",
69+
"un": "U",
70+
"rr": "R",
71+
"ss": "S",
72+
"lh": "ʎ",
73+
"nh": "ɲ",
74+
"qu": "Q",
75+
"gu": "G",
76+
"gue": "G",
77+
"Ge": "G",
78+
"ce": "Ç",
79+
"ci": "C",
80+
"uo": "W",
81+
"çc": "Z",
82+
"ge": "ʒɨ",
83+
}
84+
85+
# normalize digraphs
86+
for di, n in DIMAP.items():
87+
sentence = sentence.replace(di, n)
88+
return sentence
89+
90+
# -------------------------
91+
# Phonemizer interface
92+
# -------------------------
93+
def phonemize(self, word: str, lookup_word: bool = True) -> str:
94+
"""Phonemize a single Mirandese word via espeak + correction rules."""
95+
if lookup_word and word.lower() in self.GOLD:
96+
return self.GOLD[word.lower()]
97+
word = self.normalize(word)
98+
phonemes = ""
99+
for idx, char in enumerate(word):
100+
if char in self.LETTERS:
101+
pho = self.LETTERS[char][0]
102+
phonemes += pho
103+
else:
104+
phonemes += char
105+
return phonemes
106+
107+
108+
if __name__ == "__main__":
109+
110+
pho = LookupTableMWL()
111+
112+
stats = pho.evaluate_on_gold(limit=None, detailed=False, show_changes=False)
113+
114+
# --- Compute PER (Phoneme Error Rate) --- # TODO - move this to evaluate_on_gold
115+
total_ref_len_stress = sum(len(v) for v in pho.GOLD.values())
116+
total_ref_len_no_stress = sum(len(pho.strip_stress(v)) for v in pho.GOLD.values())
117+
118+
per = stats['avg_edit_distance'] * stats['counts'] / total_ref_len_stress
119+
120+
per_no_stress = stats['avg_edit_distance_no_stress'] * stats['counts'] / total_ref_len_no_stress
121+
122+
# --- Print Summary Metrics ---
123+
print("\n" + "=" * 50)
124+
print(" Mirandese Phonemizer Rule Evaluation")
125+
print("=" * 50)
126+
print(f"Total Words Evaluated: {stats['counts']}\n")
127+
128+
print("## Phoneme Error Rate (PER, Full IPA Match, includes stress)")
129+
print(f"PER: {per:.2%}")
130+
131+
print("\n## Phoneme Error Rate (PER, Stress-Agnostic)")
132+
print(f"PER: {per_no_stress:.2%}")
133+
134+
# --- Print only 'wrong' words (ED > 0) ---
135+
print("\n--- Incorrectly Phonemized Words (Full IPA Match ED > 0) ---")
136+
wrong_words = stats.get("details", [])
137+
138+
if wrong_words:
139+
print(f"Total Incorrect: {len(wrong_words)} words\n")
140+
141+
# Print a header for the detailed list
142+
print(f"{'Word':<20} | {'Gold':<15} | {'Phonemized':<15} | {'ED After':<8}")
143+
print("-" * 75)
144+
145+
# Print the detailed list
146+
for d in wrong_words:
147+
print(
148+
f"{d['word']:<20} | {d['gold']:<15} | {d['phonemes']:<15} | {d['ed']:<8}")
149+
else:
150+
print("All words achieved an exact match (100% Accuracy)!")
151+
152+
sample_texts = [
153+
"Muitas lhénguas ténen proua de ls sous pergaminos antigos, de la lhiteratura screbida hai cientos d'anhos i de scritores hai muito afamados, hoije bandeiras dessas lhénguas. Mas outras hai que nun puoden tener proua de nada desso, cumo ye l causo de la lhéngua mirandesa.",
154+
"Todos ls seres houmanos nácen lhibres i eiguales an honra i an dreitos. Dotados de rezon i de cuncéncia, dében de se dar bien uns culs outros i cumo armano",
155+
"Hai más fuogo alhá, i ye deimingo!"
156+
]
157+
for t in sample_texts:
158+
print(pho.phonemize_sentence(t))

0 commit comments

Comments
 (0)