Skip to content

Release 0.4.0a1#31

Open
github-actions[bot] wants to merge 20 commits into
masterfrom
release-0.4.0a1
Open

Release 0.4.0a1#31
github-actions[bot] wants to merge 20 commits into
masterfrom
release-0.4.0a1

Conversation

@github-actions

Copy link
Copy Markdown

Human review requested!

JarbasAl and others added 20 commits February 25, 2026 20:50
refactor: upstream some logic to dependencies
Add a docs set (quickstart, api, advanced, tokenizer) and six runnable
example scripts covering dialect phonemization, regional accents, number
normalization, the token tree, and accent serialization.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align CI to OpenVoiceOS/gh-automations@dev and add the standard pipeline:
build-tests, coverage, license_check, lint, pip_audit, repo-health, plus a
conventional-release-labels job. release_workflow and publish_stable now
reference @dev instead of TigreGotico/gh-automations@master and pass
version_file: tugaphone/version.py.

Migrate packaging to pyproject.toml (pyproject-only): dynamic version from
tugaphone/version.py, real description, classifiers, dependencies and a test
extra; drop setup.py and requirements.txt. Add the Apache-2.0 LICENSE the
README declares, a .gitignore covering egg-info/pycache/build/coverage, and
replace the duplicate conventional-label.yaml. Export __version__ from the
package.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- as_dict serializes every configured rule and raises on an unmapped rule
  instead of silently dropping it; complete RULE_MAP with the five preset
  rules it was missing, so as_dict/from_dict round-trips losslessly (#8).
- retain_ou_diphthong drops the bogus 'boa' fixed mapping (a word with no
  <ou> grapheme must not gain an /ow/ diphthong); the circumflex guard runs
  first and the tonic slice uses len() rather than a fixed offset (#9).
- conservative_o_nasal_retention slices the final nasal ending by its matched
  length rather than a hardcoded codepoint count, so combining diacritics no
  longer corrupt the result (#10).
- palatal_affrication_ch affricates only the <ch>-derived fricative (at most
  one per <ch> digraph), leaving a syllable-final coda fricative untouched (#11).
- epenthetic_j_before_palatal fires on both stressed and unstressed vowels
  before a palatal consonant (#12).

Add a tests/ suite covering all five fixes and the regional layer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ifier (#25)

- stress detection delegates to the declarative StressRules of the
  dialect's orthography2ipa spec (written accents incl. circumflex >
  oxytone endings > paroxytone default), correcting the overbroad bare
  -m/-n oxytone triggers (homem, falam are paroxytone) and the missed
  circumflex marker (lampada); the hand-tuned fallback remains for
  dialects whose spec carries no stress block
- SilabificadorSyllabifier registers in the orthography2ipa.syllabify
  group so the base library's own stress detection syllabifies
  Portuguese with silabificador instead of its naive splitter
- TugaphoneG2PPlugin exposes the full pipeline behind the shared
  G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes
  orthography2ipa and owns the Portuguese pipeline
- fix: conservative_o_nasal_retention emitted a decomposed nasal o
  that never matched the precomposed form callers compare against
- fix: drop the license classifier that PEP 639 setuptools rejects
  alongside the SPDX license expression
…ation in the rule cascade (#26)

* feat: source stress rules from orthography2ipa and ship its pt syllabifier

- stress detection delegates to the declarative StressRules of the
  dialect's orthography2ipa spec (written accents incl. circumflex >
  oxytone endings > paroxytone default), correcting the overbroad bare
  -m/-n oxytone triggers (homem, falam are paroxytone) and the missed
  circumflex marker (lampada); the hand-tuned fallback remains for
  dialects whose spec carries no stress block
- SilabificadorSyllabifier registers in the orthography2ipa.syllabify
  group so the base library's own stress detection syllabifies
  Portuguese with silabificador instead of its naive splitter
- TugaphoneG2PPlugin exposes the full pipeline behind the shared
  G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes
  orthography2ipa and owns the Portuguese pipeline
- fix: conservative_o_nasal_retention emitted a decomposed nasal o
  that never matched the precomposed form callers compare against
- fix: drop the license classifier that PEP 639 setuptools rejects
  alongside the SPDX license expression

* fix: positional r/s realisation, prevocalic glides and coda palatalization

- r distribution: empty prev_char (first char in grapheme) was matching
  `prev_char in "lns"` via Python substring semantics ("" in "lns" = True),
  causing every single r that starts a grapheme token to produce ʁ instead
  of the tap ɾ.  Fix: require prev_letter to be non-empty; all r in
  non-initial, non-post-{l,n,s} positions now correctly return ɾ.
- c/g before front vowel: next_char was None for the first char in a
  single-char grapheme, so the front-vowel check never fired and c always
  produced k.  Fix: use cross-grapheme next_letter (via suffix) for c/g.
- Positional context properties (is_intervocalic, is_between_consonant_vowel,
  is_between_vowel_consonant) now use word-level prefix/suffix to cross
  grapheme boundaries instead of the intra-grapheme prev_char/next_char
  which was always None for single-char graphemes.
- Intervocalic s→z now fires correctly across grapheme boundaries.
- Coda s→ʃ palatalization added for pt-PT: word-final s and s before
  voiceless consonants produce ʃ.
- Stressed plain e/o default changed from open-mid ɛ/ɔ to closed-mid e/o;
  only diacritically marked é/ó force the open variants.
- Unstressed i before a vowel → palatal glide j.

PER trajectory (--limit 300 --strip-stress --broad):
  baseline: 0.178
  after fix: 0.065

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* feat: source stress rules from orthography2ipa and ship its pt syllabifier

- stress detection delegates to the declarative StressRules of the
  dialect's orthography2ipa spec (written accents incl. circumflex >
  oxytone endings > paroxytone default), correcting the overbroad bare
  -m/-n oxytone triggers (homem, falam are paroxytone) and the missed
  circumflex marker (lampada); the hand-tuned fallback remains for
  dialects whose spec carries no stress block
- SilabificadorSyllabifier registers in the orthography2ipa.syllabify
  group so the base library's own stress detection syllabifies
  Portuguese with silabificador instead of its naive splitter
- TugaphoneG2PPlugin exposes the full pipeline behind the shared
  G2PPlugin interface (lazy tagger/lexicon load); tugaphone consumes
  orthography2ipa and owns the Portuguese pipeline
- fix: conservative_o_nasal_retention emitted a decomposed nasal o
  that never matched the precomposed form callers compare against
- fix: drop the license classifier that PEP 639 setuptools rejects
  alongside the SPDX license expression

* fix: positional r/s realisation, prevocalic glides and coda palatalization

- r distribution: empty prev_char (first char in grapheme) was matching
  `prev_char in "lns"` via Python substring semantics ("" in "lns" = True),
  causing every single r that starts a grapheme token to produce ʁ instead
  of the tap ɾ.  Fix: require prev_letter to be non-empty; all r in
  non-initial, non-post-{l,n,s} positions now correctly return ɾ.
- c/g before front vowel: next_char was None for the first char in a
  single-char grapheme, so the front-vowel check never fired and c always
  produced k.  Fix: use cross-grapheme next_letter (via suffix) for c/g.
- Positional context properties (is_intervocalic, is_between_consonant_vowel,
  is_between_vowel_consonant) now use word-level prefix/suffix to cross
  grapheme boundaries instead of the intra-grapheme prev_char/next_char
  which was always None for single-char graphemes.
- Intervocalic s→z now fires correctly across grapheme boundaries.
- Coda s→ʃ palatalization added for pt-PT: word-final s and s before
  voiceless consonants produce ʃ.
- Stressed plain e/o default changed from open-mid ɛ/ɔ to closed-mid e/o;
  only diacritically marked é/ó force the open variants.
- Unstressed i before a vowel → palatal glide j.

PER trajectory (--limit 300 --strip-stress --broad):
  baseline: 0.178
  after fix: 0.065

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: accept IRREGULAR_WORDS kwarg in AngolanPortuguese, MozambicanPortuguese, TimoresePortuguese

All three African/Timorese constructors only accepted positional args; the
g2p_bench TugaphoneRulesEngine adapter passes IRREGULAR_WORDS=<sentinel>
to disable lexicon lookup for honest rules-only evaluation.  Without the
fix the sentinel was silently ignored, the lexicon was loaded anyway, and
coverage was reported as 0 for pt-AO/MZ/TL.

Standardise all five dialect constructors to the same
(dialect_code=None, IRREGULAR_WORDS=None, **kwargs) signature so the
adapter works uniformly; guard uses `if IRREGULAR_WORDS is not None`
to distinguish the empty-dict sentinel from the missing-kwarg case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: per-dialect vowel reduction and Brazilian palatalization in G2P cascade

Vowel reduction rules previously applied European-style reduction for all
dialects; now gated by dialect code:

  a:  pt-PT → [ɐ] unstressed  |  pt-TL → [ə] (Tetum schwa)
      pt-AO/BR → stays [a]    |  pt-MZ → [ɐ] (Bantu substrate retains EP)
  o:  pt-PT/MZ → [u] unstressed  |  pt-AO/TL/BR → stays [o]
  e:  pt-PT → [ɨ] (unchanged)    |  pt-BR final unstressed → [ɪ]

Brazilian palatalization extended to cover final unstressed -te/-de
(suffix == "e" guard) so "abacate" → [abakatʃɪ] and "abade" → [abadʒɪ].
Final unstressed 'o' in pt-BR raises to [ʊ]: "abadejo" → [abadeʒʊ].

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: add dialect quality tests for regional presets and cross-dialect phonology

One gold case per regional preset exercising its documented signature
feature (Coimbra ou-retention, Minho centralization resistance, Braga
nasal-glide palatalization, Famalicão õ-retention, Transmontano ch-
affrication, Porto rising-diphthong o, Fafe nasal diphthongization e).

Cross-dialect sanity checks: BR t/d palatalization, coda-l vocalization,
unstressed-a non-reduction; AO/MZ/TL absence of uvular [ʁ]; TL schwa.
Rules-only path used where lexicon lookup would shadow cascade behaviour.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant