Release 1.1.0a1 by github-actions[bot] · Pull Request #41 · TigreGotico/orthography2ipa

github-actions · 2026-06-11T12:58:35Z

Human review requested!

* feat: positional graphemmes * feat: positional graphemes * feat: positional graphemes

* reconstruct latin graphemes * fix(pt-PT):model 4-way sibilant distinction * feat: add pt-BR * feat: more positional contexts * refactor: drop redundant "default" from positional mappings * allow trema in pt-BR graphemes

* feat: asturian * feat: galician * feat: galician

…fix type hints and metadata - Create PLAN.md: architecture overview, planned phases, data roadmap - Create TODO.md: prioritised task list (blocking → low) - Create QUICK_FACTS.md: package identity, key classes, quick usage examples - Create AUDIT.md: known issues with file:line citations, CI gaps, tech debt - Create SUGGESTIONS.md: 10 proposals for refactors and enhancements - feats.py:39 — add Dict, Optional to typing imports - feats.py:56 — annotate phone_features: Dict[str, List[Optional[bool]]] - json_loader.py:115 — clarify self-reference cycle comment (was TODO - error log, illegal) - pyproject.toml:8 — update description from "20+ languages" to "308+ language codes" All 7375 tests pass. No behavioral changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds tests/test_iberian.py with extensive per-language test classes for all Iberian Peninsula languages, plus a per-language coverage reporter in conftest.py. Languages covered (15 classes, 485 tests): es-ES 86 tests — graphemes, allophones, positional rules, isoglosses pt-PT 62 tests — null graphemes, sandhi, /v/ preservation, schwa ca 71 tests — ela geminada, vowel reduction, digraphs, diphthongs gl 57 tests — seseo, nasal vowels, null lh/nh/ç, apical s eu 47 tests — sibilant contrast (s̺/s̻), affricates, phonemic h ast 50 tests — distinción, x→ʃ isogloss, ll→ʎ, aspirated h notation an 59 tests — /v/ preservation, seseo, affricates ts/dz, ix→ʃ mwl 13 tests — inheritance from ast-PT-x-medieval, ancestry dialects 40 tests — es-AR, es-ES-x-andalusia-e, ca-x-valencia, ca-x-balear, pt-BR, gl-x-occidental Cross-language isogloss tests (10): distinción vs seseo, /v/ preservation, ll realisation, ch realisation, rr uvular vs alveolar, h silent vs phonemic, apical vs predorsal s, ast x→ʃ vs es x→ks, phonological distance clustering, Basque isolation Coverage reporter (conftest.py): pytest_terminal_summary prints a per-language pass/fail/total/% table at the end of any run touching test_iberian.py. Run with: pytest tests/test_iberian.py -v --tb=short Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sources - Add LinguisticSource frozen dataclass to types.py - Add sources: Tuple[LinguisticSource, ...] to LanguageSpec - Update json_loader.py to parse sources array - Update SCHEMA.md to document sources field - Add sources arrays to 33 Germanic language JSON files: en-GB, en-US, en-AU, en-CA, en-IE, en-ZA, en-GB-x-scotland, de-DE, de-AT, de-CH, nl, nl-NL, nl-BE, sv, sv-x-rikssvenska, nb, nn, no, da, da-x-copenhagen, is, fo, af, nds, enm, ang, non, osx, goh, gem, gem-x-ingvaeonic, gem-x-north, gem-x-northwest - Add tests/test_sources.py (marked @pytest.mark.linguistic) - Create docs/bibliography.md with Phase 1 sources - Update PLAN.md and TODO.md with audit phase tracking - Update MAINTENANCE_REPORT.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dDistance, positional_divergence Part A — bug fixes and hardening: - segment_distance(strict=True) raises ValueError for unknown IPA segments - _build_ancestor_graph() detects circular ancestry and raises ValueError - _get_ancestry_weights_by_code() cached with lru_cache(maxsize=256) via thin wrapper Part B — new metrics: - phoneme_coverage(spec_native, spec_target) -> float (asymmetric L2 transfer estimate) - WeightedDistance frozen dataclass added to types.py - weighted_full_distance() single entry-point with configurable w_inventory/grapheme/allophone/ancestry - positional_divergence() measures positional-override divergence between two specs Part C — tests & docs: - 13 new tests in tests/test_distance.py (all pass, no regressions) - docs/distance.md extended with sections for all new functions and weight-tuning guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New test files covering 9 language families (956 tests total): - test_germanic.py: de-DE, de-AT, Bavarian, nl-NL, nl-BE, af, sv, da, nb, is - test_celtic.py: cy, ga, gd, br, gv, kw - test_slavic.py: ru, pl, cs, bg, sk, uk, be, hr/sl/sr/mk - test_romance_extended2.py: Italian dialects, Romanian, Sardinian, Aranese, Caribbean Spanish, Medieval Spanish, Brazilian/Portuguese dialects - test_indo_iranian.py: hi, sa, fa, fa-x-tehran, fa-AF, tr - test_arabic.py: arb, ar-x-mashriqi, ar-x-maghrebi, ar-MA, ar-x-gulf, ar-IQ - test_other_languages.py Also add germanic/celtic/slavic pytest markers to conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AUDIT.md: mark resolved items (feats.py type annotations, json_loader.py comment, pyproject.toml description, en-GB.json, LinguisticSource); restructure open issues; update date to 2026-03-17. MAINTENANCE_REPORT.md: add transparency report for multi-family language test suites session (Germanic/Celtic/Slavic/Romance/Indo-Iranian/Arabic, +956 tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removes stale en/es/fr/pt-BR sections that exercised the old dict-based grapheme API; replaces with a minimal pt-PT demo compatible with the current list-based grapheme structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

02_distance_metrics.py — segment features, inventory/grapheme/allophone distances, ancestry similarity, phoneme_coverage, weighted_full_distance, pairwise matrix 03_tokenizer.py — PhonetokTokenizer: maximal-munch segmentation, TokenKind, ipa_beam with allophone expansion, multi-language comparison 04_dialect_transforms.py — DIALECT_PROFILES inspection, apply_transform, debias_lisbon, cross-dialect word comparison for Portuguese 05_script_distance.py — SCRIPT_REGISTRY, ScriptFeatures, pairwise script distance matrix, closest/farthest pairs, feature analysis 06_sandhi.py — SandhiEngine, French liaison rules, obligatory_only mode, custom Sanskrit sandhi rules, languages-with-sandhi survey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…an (6 files) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reconcile the Aragonese test class with the expert-revised an.json grapheme and positional data: open vowels written è/ò, betacism on v, ʝ-initial y, x/tʃ for g before front vowels, and dental fricative realisations without seseo. Drop assertions for graphemes and positional rules no longer present in the spec.

Replace the dead العربية_الجزائرية URL with the verified live لهجة_جزائرية article and record the correction in the link audit.

…xports (#17) - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface

…dia links (#20) - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories

#18) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * ci: fix coverage install_extras and exclude self from license check install_extras:'test' was passed as a bare package name instead of '.[test]'; the gh-automations coverage workflow treats it as a pip install target. The license check now excludes orthography2ipa itself (pilosus reports Error for packages installed from source builds). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…rity dispatch (#21) - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order

…#23) * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * ci: fix coverage install_extras and exclude self from license check install_extras:'test' was passed as a bare package name instead of '.[test]'; the gh-automations coverage workflow treats it as a pip install target. The license check now excludes orthography2ipa itself. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * ci: fix coverage install_extras and exclude self from license check install_extras:'test' was passed as a bare package name instead of '.[test]'; the gh-automations coverage workflow treats it as a pip install target. The license check now excludes orthography2ipa itself. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins

…26) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README

…y + mwl j/cedilla fix (#28) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

JarbasAl and others added 30 commits February 21, 2026 16:34

feat: positional graphemmes (#4)

6438b2d

* feat: positional graphemmes * feat: positional graphemes * feat: positional graphemes

Increment Version to

94959b7

Update Changelog

bbee58b

refactor to json

a7ab6da

Increment Version to

44ba8a6

Update Changelog

270c78d

Delete dump/tests directory

f42d5aa

Increment Version to

c751f91

Update Changelog

14cf27d

feat: ast+gl (#8)

51d7b0e

* feat: asturian * feat: galician * feat: galician

Increment Version to

f4f0ad0

Update Changelog

81ebbc1

feat: castillian + aragonese

051bc62

fix: extremaduran / cantabrian

2899f13

fix: leonese

609be4e

chore: add langcodes to requirements.txt; add uv.lock

55f05cc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Afroasiatic (5 files)

db30134

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Aragonese (1 file)

ad4fbd9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Asturleonese (12 files)

95890bc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Austroasiatic + Austronesi…

af894fc

…an (6 files) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Celtic (15 files)

fcd7ee9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JarbasAl and others added 30 commits May 30, 2026 00:24

fix(data): point ar-DZ wikipedia link to live Algerian Arabic article

72eff7c

Replace the dead العربية_الجزائرية URL with the verified live لهجة_جزائرية article and record the correction in the link audit.

docs: add ROADMAP and TODO

656b39b

chore: stop tracking TODO.md and ROADMAP.md (local planning only)

a9f8ead

Increment Version to 0.2.1a1

42a4dc1

Update Changelog

67a84e5

Increment Version to 0.2.1a2

8a1d37a

Update Changelog

e7ea189

Increment Version to 0.3.0a1

e26633c

Update Changelog

b3f0b6e

Increment Version to 0.4.0a1

2552f1e

Increment Version to 0.5.0a1

a44d933

Update Changelog

750187c

Increment Version to 0.6.0a1

d3a4e2a

Update Changelog

6b338d9

Increment Version to 0.7.0a1

ea0c880

Update Changelog

ebbf9b7

Increment Version to 1.0.0a1

7c27549

Update Changelog

ab8a4fa

Increment Version to 1.1.0a1

cb253a4

Update Changelog

2bdc170

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.1.0a1#41

Release 1.1.0a1#41
github-actions[bot] wants to merge 122 commits into
masterfrom
release-1.1.0a1

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants