Release 0.4.0a1#31
Open
github-actions[bot] wants to merge 20 commits into
Open
Conversation
refactor: upstream some logic to dependencies
Add a docs set (quickstart, api, advanced, tokenizer) and six runnable example scripts covering dialect phonemization, regional accents, number normalization, the token tree, and accent serialization. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align CI to OpenVoiceOS/gh-automations@dev and add the standard pipeline: build-tests, coverage, license_check, lint, pip_audit, repo-health, plus a conventional-release-labels job. release_workflow and publish_stable now reference @dev instead of TigreGotico/gh-automations@master and pass version_file: tugaphone/version.py. Migrate packaging to pyproject.toml (pyproject-only): dynamic version from tugaphone/version.py, real description, classifiers, dependencies and a test extra; drop setup.py and requirements.txt. Add the Apache-2.0 LICENSE the README declares, a .gitignore covering egg-info/pycache/build/coverage, and replace the duplicate conventional-label.yaml. Export __version__ from the package. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- as_dict serializes every configured rule and raises on an unmapped rule instead of silently dropping it; complete RULE_MAP with the five preset rules it was missing, so as_dict/from_dict round-trips losslessly (#8). - retain_ou_diphthong drops the bogus 'boa' fixed mapping (a word with no <ou> grapheme must not gain an /ow/ diphthong); the circumflex guard runs first and the tonic slice uses len() rather than a fixed offset (#9). - conservative_o_nasal_retention slices the final nasal ending by its matched length rather than a hardcoded codepoint count, so combining diacritics no longer corrupt the result (#10). - palatal_affrication_ch affricates only the <ch>-derived fricative (at most one per <ch> digraph), leaving a syllable-final coda fricative untouched (#11). - epenthetic_j_before_palatal fires on both stressed and unstressed vowels before a palatal consonant (#12). Add a tests/ suite covering all five fixes and the regional layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ifier (#25) - stress detection delegates to the declarative StressRules of the dialect's orthography2ipa spec (written accents incl. circumflex > oxytone endings > paroxytone default), correcting the overbroad bare -m/-n oxytone triggers (homem, falam are paroxytone) and the missed circumflex marker (lampada); the hand-tuned fallback remains for dialects whose spec carries no stress block - SilabificadorSyllabifier registers in the orthography2ipa.syllabify group so the base library's own stress detection syllabifies Portuguese with silabificador instead of its naive splitter - TugaphoneG2PPlugin exposes the full pipeline behind the shared G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes orthography2ipa and owns the Portuguese pipeline - fix: conservative_o_nasal_retention emitted a decomposed nasal o that never matched the precomposed form callers compare against - fix: drop the license classifier that PEP 639 setuptools rejects alongside the SPDX license expression
…ation in the rule cascade (#26) * feat: source stress rules from orthography2ipa and ship its pt syllabifier - stress detection delegates to the declarative StressRules of the dialect's orthography2ipa spec (written accents incl. circumflex > oxytone endings > paroxytone default), correcting the overbroad bare -m/-n oxytone triggers (homem, falam are paroxytone) and the missed circumflex marker (lampada); the hand-tuned fallback remains for dialects whose spec carries no stress block - SilabificadorSyllabifier registers in the orthography2ipa.syllabify group so the base library's own stress detection syllabifies Portuguese with silabificador instead of its naive splitter - TugaphoneG2PPlugin exposes the full pipeline behind the shared G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes orthography2ipa and owns the Portuguese pipeline - fix: conservative_o_nasal_retention emitted a decomposed nasal o that never matched the precomposed form callers compare against - fix: drop the license classifier that PEP 639 setuptools rejects alongside the SPDX license expression * fix: positional r/s realisation, prevocalic glides and coda palatalization - r distribution: empty prev_char (first char in grapheme) was matching `prev_char in "lns"` via Python substring semantics ("" in "lns" = True), causing every single r that starts a grapheme token to produce ʁ instead of the tap ɾ. Fix: require prev_letter to be non-empty; all r in non-initial, non-post-{l,n,s} positions now correctly return ɾ. - c/g before front vowel: next_char was None for the first char in a single-char grapheme, so the front-vowel check never fired and c always produced k. Fix: use cross-grapheme next_letter (via suffix) for c/g. - Positional context properties (is_intervocalic, is_between_consonant_vowel, is_between_vowel_consonant) now use word-level prefix/suffix to cross grapheme boundaries instead of the intra-grapheme prev_char/next_char which was always None for single-char graphemes. - Intervocalic s→z now fires correctly across grapheme boundaries. - Coda s→ʃ palatalization added for pt-PT: word-final s and s before voiceless consonants produce ʃ. - Stressed plain e/o default changed from open-mid ɛ/ɔ to closed-mid e/o; only diacritically marked é/ó force the open variants. - Unstressed i before a vowel → palatal glide j. PER trajectory (--limit 300 --strip-stress --broad): baseline: 0.178 after fix: 0.065 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* feat: source stress rules from orthography2ipa and ship its pt syllabifier
- stress detection delegates to the declarative StressRules of the
dialect's orthography2ipa spec (written accents incl. circumflex >
oxytone endings > paroxytone default), correcting the overbroad bare
-m/-n oxytone triggers (homem, falam are paroxytone) and the missed
circumflex marker (lampada); the hand-tuned fallback remains for
dialects whose spec carries no stress block
- SilabificadorSyllabifier registers in the orthography2ipa.syllabify
group so the base library's own stress detection syllabifies
Portuguese with silabificador instead of its naive splitter
- TugaphoneG2PPlugin exposes the full pipeline behind the shared
G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes
orthography2ipa and owns the Portuguese pipeline
- fix: conservative_o_nasal_retention emitted a decomposed nasal o
that never matched the precomposed form callers compare against
- fix: drop the license classifier that PEP 639 setuptools rejects
alongside the SPDX license expression
* fix: positional r/s realisation, prevocalic glides and coda palatalization
- r distribution: empty prev_char (first char in grapheme) was matching
`prev_char in "lns"` via Python substring semantics ("" in "lns" = True),
causing every single r that starts a grapheme token to produce ʁ instead
of the tap ɾ. Fix: require prev_letter to be non-empty; all r in
non-initial, non-post-{l,n,s} positions now correctly return ɾ.
- c/g before front vowel: next_char was None for the first char in a
single-char grapheme, so the front-vowel check never fired and c always
produced k. Fix: use cross-grapheme next_letter (via suffix) for c/g.
- Positional context properties (is_intervocalic, is_between_consonant_vowel,
is_between_vowel_consonant) now use word-level prefix/suffix to cross
grapheme boundaries instead of the intra-grapheme prev_char/next_char
which was always None for single-char graphemes.
- Intervocalic s→z now fires correctly across grapheme boundaries.
- Coda s→ʃ palatalization added for pt-PT: word-final s and s before
voiceless consonants produce ʃ.
- Stressed plain e/o default changed from open-mid ɛ/ɔ to closed-mid e/o;
only diacritically marked é/ó force the open variants.
- Unstressed i before a vowel → palatal glide j.
PER trajectory (--limit 300 --strip-stress --broad):
baseline: 0.178
after fix: 0.065
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix: accept IRREGULAR_WORDS kwarg in AngolanPortuguese, MozambicanPortuguese, TimoresePortuguese
All three African/Timorese constructors only accepted positional args; the
g2p_bench TugaphoneRulesEngine adapter passes IRREGULAR_WORDS=<sentinel>
to disable lexicon lookup for honest rules-only evaluation. Without the
fix the sentinel was silently ignored, the lexicon was loaded anyway, and
coverage was reported as 0 for pt-AO/MZ/TL.
Standardise all five dialect constructors to the same
(dialect_code=None, IRREGULAR_WORDS=None, **kwargs) signature so the
adapter works uniformly; guard uses `if IRREGULAR_WORDS is not None`
to distinguish the empty-dict sentinel from the missing-kwarg case.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: per-dialect vowel reduction and Brazilian palatalization in G2P cascade
Vowel reduction rules previously applied European-style reduction for all
dialects; now gated by dialect code:
a: pt-PT → [ɐ] unstressed | pt-TL → [ə] (Tetum schwa)
pt-AO/BR → stays [a] | pt-MZ → [ɐ] (Bantu substrate retains EP)
o: pt-PT/MZ → [u] unstressed | pt-AO/TL/BR → stays [o]
e: pt-PT → [ɨ] (unchanged) | pt-BR final unstressed → [ɪ]
Brazilian palatalization extended to cover final unstressed -te/-de
(suffix == "e" guard) so "abacate" → [abakatʃɪ] and "abade" → [abadʒɪ].
Final unstressed 'o' in pt-BR raises to [ʊ]: "abadejo" → [abadeʒʊ].
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: add dialect quality tests for regional presets and cross-dialect phonology
One gold case per regional preset exercising its documented signature
feature (Coimbra ou-retention, Minho centralization resistance, Braga
nasal-glide palatalization, Famalicão õ-retention, Transmontano ch-
affrication, Porto rising-diphthong o, Fafe nasal diphthongization e).
Cross-dialect sanity checks: BR t/d palatalization, coda-l vocalization,
unstressed-a non-reduction; AO/MZ/TL absence of uvular [ʁ]; TL schwa.
Rules-only path used where lexicon lookup would shadow cascade behaviour.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human review requested!