Release 1.2.0a1#45
Open
github-actions[bot] wants to merge 134 commits into
Open
Conversation
* feat: positional graphemmes * feat: positional graphemes * feat: positional graphemes
* reconstruct latin graphemes * fix(pt-PT):model 4-way sibilant distinction * feat: add pt-BR * feat: more positional contexts * refactor: drop redundant "default" from positional mappings * allow trema in pt-BR graphemes
* feat: asturian * feat: galician * feat: galician
…fix type hints and metadata - Create PLAN.md: architecture overview, planned phases, data roadmap - Create TODO.md: prioritised task list (blocking → low) - Create QUICK_FACTS.md: package identity, key classes, quick usage examples - Create AUDIT.md: known issues with file:line citations, CI gaps, tech debt - Create SUGGESTIONS.md: 10 proposals for refactors and enhancements - feats.py:39 — add Dict, Optional to typing imports - feats.py:56 — annotate phone_features: Dict[str, List[Optional[bool]]] - json_loader.py:115 — clarify self-reference cycle comment (was TODO - error log, illegal) - pyproject.toml:8 — update description from "20+ languages" to "308+ language codes" All 7375 tests pass. No behavioral changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/test_iberian.py with extensive per-language test classes for
all Iberian Peninsula languages, plus a per-language coverage reporter
in conftest.py.
Languages covered (15 classes, 485 tests):
es-ES 86 tests — graphemes, allophones, positional rules, isoglosses
pt-PT 62 tests — null graphemes, sandhi, /v/ preservation, schwa
ca 71 tests — ela geminada, vowel reduction, digraphs, diphthongs
gl 57 tests — seseo, nasal vowels, null lh/nh/ç, apical s
eu 47 tests — sibilant contrast (s̺/s̻), affricates, phonemic h
ast 50 tests — distinción, x→ʃ isogloss, ll→ʎ, aspirated h notation
an 59 tests — /v/ preservation, seseo, affricates ts/dz, ix→ʃ
mwl 13 tests — inheritance from ast-PT-x-medieval, ancestry
dialects 40 tests — es-AR, es-ES-x-andalusia-e, ca-x-valencia, ca-x-balear,
pt-BR, gl-x-occidental
Cross-language isogloss tests (10):
distinción vs seseo, /v/ preservation, ll realisation, ch realisation,
rr uvular vs alveolar, h silent vs phonemic, apical vs predorsal s,
ast x→ʃ vs es x→ks, phonological distance clustering, Basque isolation
Coverage reporter (conftest.py):
pytest_terminal_summary prints a per-language pass/fail/total/% table
at the end of any run touching test_iberian.py.
Run with: pytest tests/test_iberian.py -v --tb=short
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sources - Add LinguisticSource frozen dataclass to types.py - Add sources: Tuple[LinguisticSource, ...] to LanguageSpec - Update json_loader.py to parse sources array - Update SCHEMA.md to document sources field - Add sources arrays to 33 Germanic language JSON files: en-GB, en-US, en-AU, en-CA, en-IE, en-ZA, en-GB-x-scotland, de-DE, de-AT, de-CH, nl, nl-NL, nl-BE, sv, sv-x-rikssvenska, nb, nn, no, da, da-x-copenhagen, is, fo, af, nds, enm, ang, non, osx, goh, gem, gem-x-ingvaeonic, gem-x-north, gem-x-northwest - Add tests/test_sources.py (marked @pytest.mark.linguistic) - Create docs/bibliography.md with Phase 1 sources - Update PLAN.md and TODO.md with audit phase tracking - Update MAINTENANCE_REPORT.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dDistance, positional_divergence Part A — bug fixes and hardening: - segment_distance(strict=True) raises ValueError for unknown IPA segments - _build_ancestor_graph() detects circular ancestry and raises ValueError - _get_ancestry_weights_by_code() cached with lru_cache(maxsize=256) via thin wrapper Part B — new metrics: - phoneme_coverage(spec_native, spec_target) -> float (asymmetric L2 transfer estimate) - WeightedDistance frozen dataclass added to types.py - weighted_full_distance() single entry-point with configurable w_inventory/grapheme/allophone/ancestry - positional_divergence() measures positional-override divergence between two specs Part C — tests & docs: - 13 new tests in tests/test_distance.py (all pass, no regressions) - docs/distance.md extended with sections for all new functions and weight-tuning guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New test files covering 9 language families (956 tests total): - test_germanic.py: de-DE, de-AT, Bavarian, nl-NL, nl-BE, af, sv, da, nb, is - test_celtic.py: cy, ga, gd, br, gv, kw - test_slavic.py: ru, pl, cs, bg, sk, uk, be, hr/sl/sr/mk - test_romance_extended2.py: Italian dialects, Romanian, Sardinian, Aranese, Caribbean Spanish, Medieval Spanish, Brazilian/Portuguese dialects - test_indo_iranian.py: hi, sa, fa, fa-x-tehran, fa-AF, tr - test_arabic.py: arb, ar-x-mashriqi, ar-x-maghrebi, ar-MA, ar-x-gulf, ar-IQ - test_other_languages.py Also add germanic/celtic/slavic pytest markers to conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AUDIT.md: mark resolved items (feats.py type annotations, json_loader.py comment, pyproject.toml description, en-GB.json, LinguisticSource); restructure open issues; update date to 2026-03-17. MAINTENANCE_REPORT.md: add transparency report for multi-family language test suites session (Germanic/Celtic/Slavic/Romance/Indo-Iranian/Arabic, +956 tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes stale en/es/fr/pt-BR sections that exercised the old dict-based grapheme API; replaces with a minimal pt-PT demo compatible with the current list-based grapheme structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
02_distance_metrics.py — segment features, inventory/grapheme/allophone distances, ancestry similarity, phoneme_coverage, weighted_full_distance, pairwise matrix 03_tokenizer.py — PhonetokTokenizer: maximal-munch segmentation, TokenKind, ipa_beam with allophone expansion, multi-language comparison 04_dialect_transforms.py — DIALECT_PROFILES inspection, apply_transform, debias_lisbon, cross-dialect word comparison for Portuguese 05_script_distance.py — SCRIPT_REGISTRY, ScriptFeatures, pairwise script distance matrix, closest/farthest pairs, feature analysis 06_sandhi.py — SandhiEngine, French liaison rules, obligatory_only mode, custom Sanskrit sandhi rules, languages-with-sandhi survey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…an (6 files) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rity dispatch (#21) - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order
…#23) * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * ci: fix coverage install_extras and exclude self from license check install_extras:'test' was passed as a bare package name instead of '.[test]'; the gh-automations coverage workflow treats it as a pip install target. The license check now excludes orthography2ipa itself. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * ci: fix coverage install_extras and exclude self from license check install_extras:'test' was passed as a bare package name instead of '.[test]'; the gh-automations coverage workflow treats it as a pip install target. The license check now excludes orthography2ipa itself. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins
…26) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README
…y + mwl j/cedilla fix (#28) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…view (#31) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): seseo varieties no longer inherit castilian c->theta positional rule 13 Latin American and regional Spanish specs (es-419, es-AR, es-BO, es-CL, es-CO, es-CO-x-costa, es-CO-x-paisa, es-CR, es-CU, es-DO, es-EC, es-ES-x-canarias) add positional_graphemes.c = {before_e:[s], before_i:[s]} to override the c->θ rule inherited via positional_graphemes_base=es-ES (itself from es-ES-x-medieval). Western Andalusian (es-ES-x-andalusia-w) gets ["s","θ"] to reflect its seseo/ceceo variation documented in its notes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove orphaned palatal-lateral allophones in yeista varieties es-BO and es-EC had allophones["ʎ"] entries but ll->["ʝ"] only, making ʎ unreachable. Both now map ll->["ʝ","ʎ"] so the allophone table is reachable. es-EC additionally adds the characteristic Ecuadorian retroflex [ʒ] to the ʎ allophone set. es-ES-x-cantabria gets a graphemes block with ll->["ʎ","ʝ"] since Cantabrian preserves partial distinction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): modern Spanish acute marks are stress-only, é->e / ó->o in es-ES Fixes es-419 and all 31 es-* descendants: é was resolving to [ɛ] and ó to [ɔ] (Medieval Spanish vowel-quality values), but modern Spanish has a strict 5-vowel system; the acute accent is a stress/disambiguation diacritic only. Added é->["e"] and ó->["o"] to es-ES.json (keeping ü->["w"]); the medieval spec is untouched because [ts]/[dz] sibilant values belong to a separate fix. Updated test_iberian.py: test_accented_e/test_accented_o now assert the corrected values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Galician (gl) distinción, nh->ŋ, remove nasal-vowel digraphs RAG-norm Galician uses distinción (c/z→θ), not seseo. Previous spec encoded c→[k,s] / z→[z,s] which was wrong. Now: c→["k","θ"], z→["θ"]; positional c before_e/i → ["θ"]; positional z block deleted (uniform θ from base grapheme). nh now maps to ["ŋ"] as in RAG standard (previously null). Nasal-vowel digraph graphemes (an/en/in/on/un/ã/ão/ãi) and their allophone keys deleted — not in RAG orthography. gl-ES gets nh->["ŋ","ɲ"] (covers both RAG and reintegrationist readers). Updated test_iberian.py: all gl seseo/nasal tests updated to correct distinción values; test_distincion_vs_seseo moves gl to the distinción set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove EP spirantization from Brazilian varieties, fix pt-PT/pt-BR-x-* Seven pt-BR-x-* dialect specs (bahia, brasilia, caipira, ce, fluminense, mg, norte) had allophones b->["b","β"] and ɡ->["ɡ","ɣ"] copied from European Portuguese — intervocalic spirantization is an EP feature absent from Brazilian Portuguese. Allophones corrected to b->["b"] and ɡ->["ɡ"]. pt-BR-x-bahia and pt-BR-x-norte: Salvador/Belém speech does palatalise /t,d/ before /i/; notes claiming 'no palatalisation' were wrong. Added t->["t","tʃ"] and d->["d","dʒ"] allophones. pt-BR-x-recife: resolved internal contradiction (t/d graphemes listed ["t","tʃ"]/["d","dʒ"] but allophones did the palatalisation — graphemes simplified to ["t"]/["d"]). pt-BR: coda r now ["ʁ","ɾ"] (not sole deletion "∅"); word_final r adds "∅" as variant since coda r is audible in careful/formal speech. pt-PT: coda s -> ["ʃ","ʒ"] (ʃ first as canonical EP value); pretonic e -> ["ɨ"] (correct central vowel, not ["i"]). pt-PT-x-madeira: replaced speculative stop-aspiration allophones with the two attested features: l->["l","ʎ"] and i->["i","ɐj"]. pt-PT-x-minho/trasosmontes: added betacism v->["v","b","β"]; Transmontano also gets graphemes["ch"]=["t͡ʃ","ʃ"] for preserved affricate. Updated test_romance_extended2.py: b_has_beta/g_has_gamma/t_no_palatal/ d_no_palatal tests inverted to correct assertions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes — accented vowels (it-IT/co/pap/is/fo), fy û/ú swap it-IT and its seven regional sub-specs (abruzzo, calabria, marche, puglia, roma, toscana, umbria): accented vowels à/è/é/ì/ò/ó/ù added to graphemes (they inherit independently and none overrides locally). co (Corsican): added grave-accented vowels à/è/ì/ò/ù; added trigraphs chj->["c"] and ghj->["ɟ"] with matching allophones (the defining palatal graphemes of standard Corsican orthography). pap (Papiamento): added è->ɛ, ò->ɔ, ù->ʏ, ü->y with allophones ʏ/y. is (Icelandic): added missing æ->["ai"]; ó corrected from ["oː"] to ["ou"] (Icelandic ó is a diphthong per Árnason 2011); allophone oː removed, ou added. fo (Faroese): ð corrected to ["∅","j","v","w"] (not a consonant — always deleted or glides); þ removed entirely (not in Faroese alphabet); allophone θ removed; ei->["aɪ"] and oy->["ɔɪ"] fixed. fy (West Frisian): û and ú had swapped values; now û->["uː"] and ú->["yː"]. Updated test_germanic.py (o_acute_is_long_mid->diphthong, u_circumflex, g_includes_velar_fricative) and test_celtic.py (circumflex_y_gives_long_schwa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): wrong inherited vowel values in gallo-italic varieties (pms/lij o->u) pms (Piedmontese) and lij (Ligurian/Genoese) inherited "o"->["o","ɔ"] from la-x-galloitalic, but both languages raise this vowel to /u/ in standard orthography. Both specs now override o->["u"], adding ò->["ɔ"] as the explicit open-mid variant. pms also gets eu->["ø"], nasal digraphs changed to ŋ sequences (an->["aŋ"], en->["ɛŋ"], on->["uŋ"], in->["iŋ"]), and nasal-vowel allophones removed. lij gets eu/êu->["ø"], æ->["ɛː"], ç->["s"], and adds allophone ɛː. roa-x-galaicopt: deleted spurious positional_graphemes blocks for a, e, o that were forcing wrong vowel qualities (coda ɐ, stressed-only ɛ/ɔ). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes in Romance/Italic varieties ca: added tz->["dz"] and ts->["ts"] digraphs (dotze, setze, organitzar). sc (Sardinian): x corrected to ["ʒ"] (not ʃ); tz->["ts"] (was malformed "tː s"); j->["j","dʒ"] (glide first as standard value). vec: added ł->["ɰ","l","∅"] (L tajà/evanescent l, a Venetian signature). rm: tg corrected to ["tɕ"] (not dʒ); gh->["ɡ"] added. fur: ç->["tʃ"], gh->["ɡ"], ss->["s"] (official Grafie ufficiale graphemes). lld (Ladin): j corrected to ["ʒ"] (not ["dʒ"]); allophone ʒ added. lad (Judeo-Spanish): v restored to ["v"] (preserves labiodental); sh->["ʃ"], ny->["ɲ"] added (AY official orthography graphemes). oc-x-aranes: ó corrected to ["u"] (classical Occitan ó reads [u]). ext: g before_e/i corrected from ["s"] to ["h","x"] (Extremaduran velar fricative, not the nonexistent sibilant); added intervocalic ["ɣ"]. an-x-oriental: added positional_graphemes for g (before_e/i->["dʒ"]) and c (before_e/i->["s"]) — the spec's defining Aragonese sibilant features. osc: í->["eː","ɪ"] and ú->["oː","ʊ"] (native Oscan vowel quality). la: added ae->["aj"] and oe->["oj"] diphthongs. la-x-gallia: deleted ce/ci entries (positional rule already handles ce/ci->ts). pcd (Picard): ch corrected to ["ʃ"]; tch->["tʃ"] added. wa (Walloon): å->["ɔː"] and tch->["tʃ"] added. Updated test_iberian_extended.py: ext g-before-e test corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Asturian/Leonese dialect feature misattributions and missing rules ast-x-occidental: deleted misattributed f-word-initial aspiration rule and h-phonemic override (F-aspiration belongs to Eastern Asturian; Western Asturian conserves Latin F-); deleted allophones block. Base ast handling restored. ast-x-oriental: added positional_graphemes f.word_initial->["h","f"] (F-aspiration is the Eastern dialect's defining feature, not Western). ast-x-leon: added j->["ʒ","d͡ʒ"]; positional g before_e/i->["ʒ"] with intervocalic ɣ restated (child-overwrites-parent shallow merge); allophones ʒ/d͡ʒ added. ast-x-sanabria: added distinción graphemes (c/z/ç->θ, positional c->θ); null overrides for positional s/z to block inherited seseo-like rules. mwl (Mirandese): added c/z graphemes with seseo values (s̻/z̻) and positional rules; added g/j->ʒ and positional g before_e/i->ʒ with intervocalic ɣ. Updated test_iberian_extended.py: occidental h-phonemic and f-word-initial tests corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Germanic phonology corrections (af/de-DE/enm/nl/sv/nds/lb) af: g->["x","ɡ"] (Afrikaans g is voiceless velar/uvular fricative, not ɣ). de-DE: ch positional corrected — after_vowel->["x","ç"], word_initial-> ["k","ç"], default->["ç"]; replaces the incorrect after_front_vowel/ after_back_vowel distinctions which don't exist in GraphemePosition enum. enm (Middle English): removed IPA-symbol consonant keys (tʃ/dʒ/θ/ð/ʃ/ʒ/ ŋ/x were not graphemes); added orthographic entries th/þ/ȝ/gh/sch/sh/c/ch/ qu/wh; e corrected from ["ə"] to ["ɛ","e","ə"] (the most frequent ME vowel). nl/nl-NL: added sch positional block — word_final->["s"] (so -isch/-sch words get [s], not [sx]); default remains ["sx"]. sv: deleted k.before_o (Swedish k does not soften before back vowel o); g.word_final corrected from ["j"] to ["ɡ","j","∅"] (majority value first). nds: sp->["sp","ʃp"] and st->["st","ʃt"] (Northern Low Saxon has both). lb (Luxembourgish): added long vowels aa/ee/oo/ii/uu and diphthongs éi/ou/ue/ie/äi; fixed ei->["ɑɪ"], added ai->["ɑɪ"]; removed äu. Updated test_germanic.py accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): rename xsb -> gem-x-suebi (xsb is ISO 639-3 for Sambal, Philippines) The code 'xsb' is assigned to Sambal (Austronesian, Philippines) in ISO 639-3. Suebi/Suevi has no ISO 639-3 code; correct private-use code is gem-x-suebi matching the gem (Germanic) parent and the X-private namespace convention used by other reconstructed/proto specs. Updated all ancestor references (gl.json, roa-x-galaicopt.json, etc.). Updated test_language_integrity.py exclusion list: gem-x-suebi replaces xsb (stub with empty graphemes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Arabic/Iranian/Indic script and grapheme corrections arb: added haraka-first vowel digraph keys (correct Unicode order for long vowels اَ→اَ); added hamza carriers ؤ/ئ->["ʔ"]. ar-TD: reclassified parent/bases to ar-x-mashriqi (Chadian Arabic is Sudanic/Eastern Arabic, not Gulf). fa: added hamza letters ء/أ/ؤ/ئ->["ʔ"]. fa-AF (Dari): ی and و now include mater lectionis readings ["j","iː","eː"] / ["v","uː","oː"]; allophones eː/oː added. fa-x-early: corrected script->Latin/script_type->alphabet (spec covers scholarly transliteration, not Perso-Arabic script). ps (Pashto): ي->["j","i"], ی->["j","i","ai"] (five-ye distinction); ه->["h","a","ə"], ۀ->["ə"] added. peo (Old Persian): added ç->["ç"] and j->["dʒ"]; allophone ʒ added. ur: ے corrected to ["eː","ɛː"] (baṛī ye is word-final /eː/, not /j/); و/ی gain mater lectionis readings; allophones eː/ɛː/oː/ɔː added. pa: added tone_inventory dict (Punjabi is phonemically tonal). pa-PK: removed graphemes_base='pa' (was inheriting Gurmukhi table into Shahmukhi/Arabic-script spec; Shahmukhi table deferred to a separate PR). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Slavic/Armenian/Baltic/Yiddish grapheme and allophone corrections ru: deleted г.before_e/before_i->["v"] positional rules (г before е/и is not /v/; that reading only applies to historical г-on-the-page = v in specific words; the default [ɡ] already handles it). cs: added Czech soft-reading digraphs dě/tě/ně/mě/di/ti/ni/dí/tí/ní. sk: added ä->["ɛ","æ"] and Slovak soft-reading digraphs de/te/ne/le/ di/ti/ni/li/dí/tí/ní/lí with palatal-first ordering. be: deleted дь/ть/рь graphemes (these digraphs don't exist in Belarusian orthography; soft sign is written differently); allophone rʲ removed. uk: removed spurious Russian-pattern final devoicing allophones b/d/dʲ/ɡ/z/zʲ/ʒ (Ukrainian does not have final devoicing). cu (Old Church Slavonic): renamed "льь"→"ль" and "рьь"→"рь" (doubled soft-sign typos); є corrected to ["e"] (OCS value); ѥ->["je"] added. rue: г corrected to ["ɦ","ɣ"] (Rusyn г is the fricative, like Ukrainian, not the stop; ґ remains ["ɡ"]). hy: added օ->["o"] (Armenian letter, distinct from latin o). lt: added ch->["x"]; h->["ɣ","x"]. yi (Yiddish): א corrected to ["","a"] (silent primary); י->["j","i"]; added YIVO graphemes ײַ/טש/דזש/זש/יִ/ײ/װ. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): South/Southeast Asian script metadata and grapheme corrections ks (Kashmiri): script corrected to Latin (spec covers romanization); voiced aspirates remapped to plain (Kashmiri deaspiration reflex); added ts/tsh dental affricates. sd (Sindhi): script corrected to Latin; added missing nasals n/ṇ/ñ/ṅ. bho (Bhojpuri): inherent_vowel corrected from "ə" to "a" (अ->["a"]). as (Assamese): added Assamese-specific letters ৰ->["r"] and ৎ->["t"]. bn (Bengali): added khanda ta ৎ->["t"] (word-final /t̪/). mr (Marathi): च/ज/झ now each have two candidates: palatal first (tɕ/dʑ/dʑʱ), dental second (ts/dz/dzʱ); dental allophones added. or (Odia): added ଢ଼->["ɽʱ"] (the missing aspirated retroflex flap). sa-x-vedic: deleted ॾ->["ɖ"] (U+097E is not a standard Devanagari letter). ml (Malayalam): ഴ corrected from ["z"] to ["ɻ"] (retroflex approximant, not a sibilant); chillu letters ൺ/ൻ/ർ/ൽ/ൾ/ൿ added. kn (Kannada): ಱ corrected to ["r"]; archaic ೞ->["ɻ"] added. te (Telugu): ఱ corrected from ["r̝"] to ["r"]. tcy (Tulu)/brx (Bodo)/mni (Meitei): script corrected to Latin, script_type to alphabet, inherent_vowel removed (specs cover Latin romanization, not native scripts). sat (Santali): script corrected to Latin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Celtic/Celtic-area/reconstructed language grapheme corrections br (Breton): added eu->["œ","ø"] (core vowel grapheme); nasal vowels añ/eñ/iñ/oñ/uñ added with corresponding allophones. cy (Welsh): ŷ corrected from ["əː"] to ["ɨː","iː"] (long clear y, North Welsh); allophone əː removed; ngh->["ŋ̊"] added (nasal mutation of c). ga (Irish): added vowel digraphs ao/eo/ia/ua/ae/aoi/eoi; added eclipsis (urú) digraphs mb/gc/nd/bp/dt/bhf/ts. gd (Scottish Gaelic): added ao/aoi->["ɯː"]; allophone ɯː added. gv (Manx): added çh->["tʃ"] (the distinctive Manx digraph); allophone tʃ added. se (Northern Sami): added c->["ts"] and z->["dz"] (two of the 29 standard letters were missing). rup (Aromanian): added lj->["ʎ"] and nj->["ɲ"] (palatal sonorants that distinguish Aromanian from Romanian). xbr (Common Brythonic): ll->["lː"] (not ɬ — Proto-Brythonic had geminate lateral, not voiceless; voiceless ɬ and r̥ are Western Brythonic innovations); rh removed; allophones ɬ/r̥ removed, lː added. xtg/xcg (Gaulish/Cisalpine Gaulish): added p->["p"] (P-Celtic *kʷ>p). xga (Galatian): script corrected to Latin. sem (Proto-Semitic): deleted f (reconstructed *p, not *f). Updated test_celtic.py: circumflex-y test updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): miscellaneous phonology corrections across diverse languages fi (Finnish): removed coda keys from positional_graphemes for k/p/t (consonant gradation is a morphophonological alternation, not a surface positional rule; coda context alone doesn't determine gradation). hu (Hungarian): removed word_final/coda devoicing positional rules (Hungarian does not have systematic final obstruent devoicing). ts (Xitsonga): deleted xi digraph (yields wrong [ʃ] — x alone handles it); added hl->["ɬ"], ndz->["ndz"], n'w->["ŋw"]. sw (Swahili): ng->["ŋɡ"] (not ["ŋ"]); positional ng block deleted (redundant once base is fixed). ny (Chichewa): ng->["ŋɡ"] (default is prenasalized, not bare ŋ). ff (Fula): bh->["ɓ"] and dh->["ɗ"] (implosives, not fricatives β/ð); spurious positional b/dh blocks removed; allophones ð/β removed. tet (Tetum): added apostrophe->["ʔ"] (ASCII and typographic, INL official). csb (Kashubian): added ł->["w"], ż->["ʒ"], ã->["ã"], ò->["wɛ"]. szl (Silesian): added ż->["ʒ"], ã->["ã"], õ->["õ"]. hsb (Upper Sorbian): added ó->["o","ʊ"] and ř->["ʃ"]. tr (Turkish): deleted k.before_o from positional_graphemes (o is a back vowel; k before o is plain [k], not the palatal [c]). ira (Proto-Iranian): added č->["tʃ"] and ǰ->["dʒ"] with allophones. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): eu ts->ts̺, rename mcm->mzs, kha grapheme fixes eu: ts corrected to ["ts̺"] (apical affricate, distinct from ts̻ which is ⟨tz⟩); allophone ts̺ added. Updated test_iberian.py: test_ts_laminal-> test_ts_apical. mcm: renamed to mzs.json — 'mcm' is ISO 639-3 for Mochica (extinct pre-Columbian Peruvian language); the spec describes Macanese Creole whose correct code is mzs. Updated pt-MO.json ancestor reference. kha (Khasi): added j->["dʒ"], ñ->["ɲ"], ï->["j"], ph->["pʰ"], th->["tʰ"] (core Khasi Latin-alphabet graphemes); y corrected to ["ə","ʔ"] (presyllable schwa/glottal, not palatal glide); allophones dʒ/ɲ/pʰ/tʰ/ə/ʔ added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): CJK grapheme corrections (zh/ko/ja) zh: renamed "en-GB" key to "en" (was a JSON parse artifact); added iao->["iau"] and uai->["uai"] (standard pinyin finals missing from spec). ko: ㄺ corrected to ["k"] and ㄿ to ["p"] (final-cluster jamo values that contradict standard Korean codaification; ㄺ final clusters simplify to /k/, not /l/). ja: added 33 katakana yōon digraphs (キャ/キュ/キョ…リョ) mirroring the existing hiragana yōon entries; katakana is standard Japanese orthography and was entirely absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): stage xsb deletion and pt-MO ancestor reference update Delete xsb.json (superseded by gem-x-suebi.json); update pt-MO.json ancestor reference from mcm to mzs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ore its period sibilants (#32) * fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): seseo varieties no longer inherit castilian c->theta positional rule 13 Latin American and regional Spanish specs (es-419, es-AR, es-BO, es-CL, es-CO, es-CO-x-costa, es-CO-x-paisa, es-CR, es-CU, es-DO, es-EC, es-ES-x-canarias) add positional_graphemes.c = {before_e:[s], before_i:[s]} to override the c->θ rule inherited via positional_graphemes_base=es-ES (itself from es-ES-x-medieval). Western Andalusian (es-ES-x-andalusia-w) gets ["s","θ"] to reflect its seseo/ceceo variation documented in its notes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove orphaned palatal-lateral allophones in yeista varieties es-BO and es-EC had allophones["ʎ"] entries but ll->["ʝ"] only, making ʎ unreachable. Both now map ll->["ʝ","ʎ"] so the allophone table is reachable. es-EC additionally adds the characteristic Ecuadorian retroflex [ʒ] to the ʎ allophone set. es-ES-x-cantabria gets a graphemes block with ll->["ʎ","ʝ"] since Cantabrian preserves partial distinction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): modern Spanish acute marks are stress-only, é->e / ó->o in es-ES Fixes es-419 and all 31 es-* descendants: é was resolving to [ɛ] and ó to [ɔ] (Medieval Spanish vowel-quality values), but modern Spanish has a strict 5-vowel system; the acute accent is a stress/disambiguation diacritic only. Added é->["e"] and ó->["o"] to es-ES.json (keeping ü->["w"]); the medieval spec is untouched because [ts]/[dz] sibilant values belong to a separate fix. Updated test_iberian.py: test_accented_e/test_accented_o now assert the corrected values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Galician (gl) distinción, nh->ŋ, remove nasal-vowel digraphs RAG-norm Galician uses distinción (c/z→θ), not seseo. Previous spec encoded c→[k,s] / z→[z,s] which was wrong. Now: c→["k","θ"], z→["θ"]; positional c before_e/i → ["θ"]; positional z block deleted (uniform θ from base grapheme). nh now maps to ["ŋ"] as in RAG standard (previously null). Nasal-vowel digraph graphemes (an/en/in/on/un/ã/ão/ãi) and their allophone keys deleted — not in RAG orthography. gl-ES gets nh->["ŋ","ɲ"] (covers both RAG and reintegrationist readers). Updated test_iberian.py: all gl seseo/nasal tests updated to correct distinción values; test_distincion_vs_seseo moves gl to the distinción set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove EP spirantization from Brazilian varieties, fix pt-PT/pt-BR-x-* Seven pt-BR-x-* dialect specs (bahia, brasilia, caipira, ce, fluminense, mg, norte) had allophones b->["b","β"] and ɡ->["ɡ","ɣ"] copied from European Portuguese — intervocalic spirantization is an EP feature absent from Brazilian Portuguese. Allophones corrected to b->["b"] and ɡ->["ɡ"]. pt-BR-x-bahia and pt-BR-x-norte: Salvador/Belém speech does palatalise /t,d/ before /i/; notes claiming 'no palatalisation' were wrong. Added t->["t","tʃ"] and d->["d","dʒ"] allophones. pt-BR-x-recife: resolved internal contradiction (t/d graphemes listed ["t","tʃ"]/["d","dʒ"] but allophones did the palatalisation — graphemes simplified to ["t"]/["d"]). pt-BR: coda r now ["ʁ","ɾ"] (not sole deletion "∅"); word_final r adds "∅" as variant since coda r is audible in careful/formal speech. pt-PT: coda s -> ["ʃ","ʒ"] (ʃ first as canonical EP value); pretonic e -> ["ɨ"] (correct central vowel, not ["i"]). pt-PT-x-madeira: replaced speculative stop-aspiration allophones with the two attested features: l->["l","ʎ"] and i->["i","ɐj"]. pt-PT-x-minho/trasosmontes: added betacism v->["v","b","β"]; Transmontano also gets graphemes["ch"]=["t͡ʃ","ʃ"] for preserved affricate. Updated test_romance_extended2.py: b_has_beta/g_has_gamma/t_no_palatal/ d_no_palatal tests inverted to correct assertions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes — accented vowels (it-IT/co/pap/is/fo), fy û/ú swap it-IT and its seven regional sub-specs (abruzzo, calabria, marche, puglia, roma, toscana, umbria): accented vowels à/è/é/ì/ò/ó/ù added to graphemes (they inherit independently and none overrides locally). co (Corsican): added grave-accented vowels à/è/ì/ò/ù; added trigraphs chj->["c"] and ghj->["ɟ"] with matching allophones (the defining palatal graphemes of standard Corsican orthography). pap (Papiamento): added è->ɛ, ò->ɔ, ù->ʏ, ü->y with allophones ʏ/y. is (Icelandic): added missing æ->["ai"]; ó corrected from ["oː"] to ["ou"] (Icelandic ó is a diphthong per Árnason 2011); allophone oː removed, ou added. fo (Faroese): ð corrected to ["∅","j","v","w"] (not a consonant — always deleted or glides); þ removed entirely (not in Faroese alphabet); allophone θ removed; ei->["aɪ"] and oy->["ɔɪ"] fixed. fy (West Frisian): û and ú had swapped values; now û->["uː"] and ú->["yː"]. Updated test_germanic.py (o_acute_is_long_mid->diphthong, u_circumflex, g_includes_velar_fricative) and test_celtic.py (circumflex_y_gives_long_schwa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): wrong inherited vowel values in gallo-italic varieties (pms/lij o->u) pms (Piedmontese) and lij (Ligurian/Genoese) inherited "o"->["o","ɔ"] from la-x-galloitalic, but both languages raise this vowel to /u/ in standard orthography. Both specs now override o->["u"], adding ò->["ɔ"] as the explicit open-mid variant. pms also gets eu->["ø"], nasal digraphs changed to ŋ sequences (an->["aŋ"], en->["ɛŋ"], on->["uŋ"], in->["iŋ"]), and nasal-vowel allophones removed. lij gets eu/êu->["ø"], æ->["ɛː"], ç->["s"], and adds allophone ɛː. roa-x-galaicopt: deleted spurious positional_graphemes blocks for a, e, o that were forcing wrong vowel qualities (coda ɐ, stressed-only ɛ/ɔ). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes in Romance/Italic varieties ca: added tz->["dz"] and ts->["ts"] digraphs (dotze, setze, organitzar). sc (Sardinian): x corrected to ["ʒ"] (not ʃ); tz->["ts"] (was malformed "tː s"); j->["j","dʒ"] (glide first as standard value). vec: added ł->["ɰ","l","∅"] (L tajà/evanescent l, a Venetian signature). rm: tg corrected to ["tɕ"] (not dʒ); gh->["ɡ"] added. fur: ç->["tʃ"], gh->["ɡ"], ss->["s"] (official Grafie ufficiale graphemes). lld (Ladin): j corrected to ["ʒ"] (not ["dʒ"]); allophone ʒ added. lad (Judeo-Spanish): v restored to ["v"] (preserves labiodental); sh->["ʃ"], ny->["ɲ"] added (AY official orthography graphemes). oc-x-aranes: ó corrected to ["u"] (classical Occitan ó reads [u]). ext: g before_e/i corrected from ["s"] to ["h","x"] (Extremaduran velar fricative, not the nonexistent sibilant); added intervocalic ["ɣ"]. an-x-oriental: added positional_graphemes for g (before_e/i->["dʒ"]) and c (before_e/i->["s"]) — the spec's defining Aragonese sibilant features. osc: í->["eː","ɪ"] and ú->["oː","ʊ"] (native Oscan vowel quality). la: added ae->["aj"] and oe->["oj"] diphthongs. la-x-gallia: deleted ce/ci entries (positional rule already handles ce/ci->ts). pcd (Picard): ch corrected to ["ʃ"]; tch->["tʃ"] added. wa (Walloon): å->["ɔː"] and tch->["tʃ"] added. Updated test_iberian_extended.py: ext g-before-e test corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Asturian/Leonese dialect feature misattributions and missing rules ast-x-occidental: deleted misattributed f-word-initial aspiration rule and h-phonemic override (F-aspiration belongs to Eastern Asturian; Western Asturian conserves Latin F-); deleted allophones block. Base ast handling restored. ast-x-oriental: added positional_graphemes f.word_initial->["h","f"] (F-aspiration is the Eastern dialect's defining feature, not Western). ast-x-leon: added j->["ʒ","d͡ʒ"]; positional g before_e/i->["ʒ"] with intervocalic ɣ restated (child-overwrites-parent shallow merge); allophones ʒ/d͡ʒ added. ast-x-sanabria: added distinción graphemes (c/z/ç->θ, positional c->θ); null overrides for positional s/z to block inherited seseo-like rules. mwl (Mirandese): added c/z graphemes with seseo values (s̻/z̻) and positional rules; added g/j->ʒ and positional g before_e/i->ʒ with intervocalic ɣ. Updated test_iberian_extended.py: occidental h-phonemic and f-word-initial tests corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Germanic phonology corrections (af/de-DE/enm/nl/sv/nds/lb) af: g->["x","ɡ"] (Afrikaans g is voiceless velar/uvular fricative, not ɣ). de-DE: ch positional corrected — after_vowel->["x","ç"], word_initial-> ["k","ç"], default->["ç"]; replaces the incorrect after_front_vowel/ after_back_vowel distinctions which don't exist in GraphemePosition enum. enm (Middle English): removed IPA-symbol consonant keys (tʃ/dʒ/θ/ð/ʃ/ʒ/ ŋ/x were not graphemes); added orthographic entries th/þ/ȝ/gh/sch/sh/c/ch/ qu/wh; e corrected from ["ə"] to ["ɛ","e","ə"] (the most frequent ME vowel). nl/nl-NL: added sch positional block — word_final->["s"] (so -isch/-sch words get [s], not [sx]); default remains ["sx"]. sv: deleted k.before_o (Swedish k does not soften before back vowel o); g.word_final corrected from ["j"] to ["ɡ","j","∅"] (majority value first). nds: sp->["sp","ʃp"] and st->["st","ʃt"] (Northern Low Saxon has both). lb (Luxembourgish): added long vowels aa/ee/oo/ii/uu and diphthongs éi/ou/ue/ie/äi; fixed ei->["ɑɪ"], added ai->["ɑɪ"]; removed äu. Updated test_germanic.py accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): rename xsb -> gem-x-suebi (xsb is ISO 639-3 for Sambal, Philippines) The code 'xsb' is assigned to Sambal (Austronesian, Philippines) in ISO 639-3. Suebi/Suevi has no ISO 639-3 code; correct private-use code is gem-x-suebi matching the gem (Germanic) parent and the X-private namespace convention used by other reconstructed/proto specs. Updated all ancestor references (gl.json, roa-x-galaicopt.json, etc.). Updated test_language_integrity.py exclusion list: gem-x-suebi replaces xsb (stub with empty graphemes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Arabic/Iranian/Indic script and grapheme corrections arb: added haraka-first vowel digraph keys (correct Unicode order for long vowels اَ→اَ); added hamza carriers ؤ/ئ->["ʔ"]. ar-TD: reclassified parent/bases to ar-x-mashriqi (Chadian Arabic is Sudanic/Eastern Arabic, not Gulf). fa: added hamza letters ء/أ/ؤ/ئ->["ʔ"]. fa-AF (Dari): ی and و now include mater lectionis readings ["j","iː","eː"] / ["v","uː","oː"]; allophones eː/oː added. fa-x-early: corrected script->Latin/script_type->alphabet (spec covers scholarly transliteration, not Perso-Arabic script). ps (Pashto): ي->["j","i"], ی->["j","i","ai"] (five-ye distinction); ه->["h","a","ə"], ۀ->["ə"] added. peo (Old Persian): added ç->["ç"] and j->["dʒ"]; allophone ʒ added. ur: ے corrected to ["eː","ɛː"] (baṛī ye is word-final /eː/, not /j/); و/ی gain mater lectionis readings; allophones eː/ɛː/oː/ɔː added. pa: added tone_inventory dict (Punjabi is phonemically tonal). pa-PK: removed graphemes_base='pa' (was inheriting Gurmukhi table into Shahmukhi/Arabic-script spec; Shahmukhi table deferred to a separate PR). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Slavic/Armenian/Baltic/Yiddish grapheme and allophone corrections ru: deleted г.before_e/before_i->["v"] positional rules (г before е/и is not /v/; that reading only applies to historical г-on-the-page = v in specific words; the default [ɡ] already handles it). cs: added Czech soft-reading digraphs dě/tě/ně/mě/di/ti/ni/dí/tí/ní. sk: added ä->["ɛ","æ"] and Slovak soft-reading digraphs de/te/ne/le/ di/ti/ni/li/dí/tí/ní/lí with palatal-first ordering. be: deleted дь/ть/рь graphemes (these digraphs don't exist in Belarusian orthography; soft sign is written differently); allophone rʲ removed. uk: removed spurious Russian-pattern final devoicing allophones b/d/dʲ/ɡ/z/zʲ/ʒ (Ukrainian does not have final devoicing). cu (Old Church Slavonic): renamed "льь"→"ль" and "рьь"→"рь" (doubled soft-sign typos); є corrected to ["e"] (OCS value); ѥ->["je"] added. rue: г corrected to ["ɦ","ɣ"] (Rusyn г is the fricative, like Ukrainian, not the stop; ґ remains ["ɡ"]). hy: added օ->["o"] (Armenian letter, distinct from latin o). lt: added ch->["x"]; h->["ɣ","x"]. yi (Yiddish): א corrected to ["","a"] (silent primary); י->["j","i"]; added YIVO graphemes ײַ/טש/דזש/זש/יִ/ײ/װ. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): South/Southeast Asian script metadata and grapheme corrections ks (Kashmiri): script corrected to Latin (spec covers romanization); voiced aspirates remapped to plain (Kashmiri deaspiration reflex); added ts/tsh dental affricates. sd (Sindhi): script corrected to Latin; added missing nasals n/ṇ/ñ/ṅ. bho (Bhojpuri): inherent_vowel corrected from "ə" to "a" (अ->["a"]). as (Assamese): added Assamese-specific letters ৰ->["r"] and ৎ->["t"]. bn (Bengali): added khanda ta ৎ->["t"] (word-final /t̪/). mr (Marathi): च/ज/झ now each have two candidates: palatal first (tɕ/dʑ/dʑʱ), dental second (ts/dz/dzʱ); dental allophones added. or (Odia): added ଢ଼->["ɽʱ"] (the missing aspirated retroflex flap). sa-x-vedic: deleted ॾ->["ɖ"] (U+097E is not a standard Devanagari letter). ml (Malayalam): ഴ corrected from ["z"] to ["ɻ"] (retroflex approximant, not a sibilant); chillu letters ൺ/ൻ/ർ/ൽ/ൾ/ൿ added. kn (Kannada): ಱ corrected to ["r"]; archaic ೞ->["ɻ"] added. te (Telugu): ఱ corrected from ["r̝"] to ["r"]. tcy (Tulu)/brx (Bodo)/mni (Meitei): script corrected to Latin, script_type to alphabet, inherent_vowel removed (specs cover Latin romanization, not native scripts). sat (Santali): script corrected to Latin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Celtic/Celtic-area/reconstructed language grapheme corrections br (Breton): added eu->["œ","ø"] (core vowel grapheme); nasal vowels añ/eñ/iñ/oñ/uñ added with corresponding allophones. cy (Welsh): ŷ corrected from ["əː"] to ["ɨː","iː"] (long clear y, North Welsh); allophone əː removed; ngh->["ŋ̊"] added (nasal mutation of c). ga (Irish): added vowel digraphs ao/eo/ia/ua/ae/aoi/eoi; added eclipsis (urú) digraphs mb/gc/nd/bp/dt/bhf/ts. gd (Scottish Gaelic): added ao/aoi->["ɯː"]; allophone ɯː added. gv (Manx): added çh->["tʃ"] (the distinctive Manx digraph); allophone tʃ added. se (Northern Sami): added c->["ts"] and z->["dz"] (two of the 29 standard letters were missing). rup (Aromanian): added lj->["ʎ"] and nj->["ɲ"] (palatal sonorants that distinguish Aromanian from Romanian). xbr (Common Brythonic): ll->["lː"] (not ɬ — Proto-Brythonic had geminate lateral, not voiceless; voiceless ɬ and r̥ are Western Brythonic innovations); rh removed; allophones ɬ/r̥ removed, lː added. xtg/xcg (Gaulish/Cisalpine Gaulish): added p->["p"] (P-Celtic *kʷ>p). xga (Galatian): script corrected to Latin. sem (Proto-Semitic): deleted f (reconstructed *p, not *f). Updated test_celtic.py: circumflex-y test updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): miscellaneous phonology corrections across diverse languages fi (Finnish): removed coda keys from positional_graphemes for k/p/t (consonant gradation is a morphophonological alternation, not a surface positional rule; coda context alone doesn't determine gradation). hu (Hungarian): removed word_final/coda devoicing positional rules (Hungarian does not have systematic final obstruent devoicing). ts (Xitsonga): deleted xi digraph (yields wrong [ʃ] — x alone handles it); added hl->["ɬ"], ndz->["ndz"], n'w->["ŋw"]. sw (Swahili): ng->["ŋɡ"] (not ["ŋ"]); positional ng block deleted (redundant once base is fixed). ny (Chichewa): ng->["ŋɡ"] (default is prenasalized, not bare ŋ). ff (Fula): bh->["ɓ"] and dh->["ɗ"] (implosives, not fricatives β/ð); spurious positional b/dh blocks removed; allophones ð/β removed. tet (Tetum): added apostrophe->["ʔ"] (ASCII and typographic, INL official). csb (Kashubian): added ł->["w"], ż->["ʒ"], ã->["ã"], ò->["wɛ"]. szl (Silesian): added ż->["ʒ"], ã->["ã"], õ->["õ"]. hsb (Upper Sorbian): added ó->["o","ʊ"] and ř->["ʃ"]. tr (Turkish): deleted k.before_o from positional_graphemes (o is a back vowel; k before o is plain [k], not the palatal [c]). ira (Proto-Iranian): added č->["tʃ"] and ǰ->["dʒ"] with allophones. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): eu ts->ts̺, rename mcm->mzs, kha grapheme fixes eu: ts corrected to ["ts̺"] (apical affricate, distinct from ts̻ which is ⟨tz⟩); allophone ts̺ added. Updated test_iberian.py: test_ts_laminal-> test_ts_apical. mcm: renamed to mzs.json — 'mcm' is ISO 639-3 for Mochica (extinct pre-Columbian Peruvian language); the spec describes Macanese Creole whose correct code is mzs. Updated pt-MO.json ancestor reference. kha (Khasi): added j->["dʒ"], ñ->["ɲ"], ï->["j"], ph->["pʰ"], th->["tʰ"] (core Khasi Latin-alphabet graphemes); y corrected to ["ə","ʔ"] (presyllable schwa/glottal, not palatal glide); allophones dʒ/ɲ/pʰ/tʰ/ə/ʔ added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): CJK grapheme corrections (zh/ko/ja) zh: renamed "en-GB" key to "en" (was a JSON parse artifact); added iao->["iau"] and uai->["uai"] (standard pinyin finals missing from spec). ko: ㄺ corrected to ["k"] and ㄿ to ["p"] (final-cluster jamo values that contradict standard Korean codaification; ㄺ final clusters simplify to /k/, not /l/). ja: added 33 katakana yōon digraphs (キャ/キュ/キョ…リョ) mirroring the existing hiragana yōon entries; katakana is standard Japanese orthography and was entirely absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): stage xsb deletion and pt-MO ancestor reference update Delete xsb.json (superseded by gem-x-suebi.json); update pt-MO.json ancestor reference from mcm to mzs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): decouple modern spanish from the old spanish spec and restore its period sibilants Materialise the fully-resolved modern tables into es-ES.json (graphemes, allophones, positional_graphemes) and remove its three *_base references to es-ES-x-medieval, cutting the data-inheritance link while keeping the parent/ancestors lineage metadata intact. es-ES now stands alone; the ~31 dialect descendants and extension varieties that base on es-ES are unaffected. Restore period-accurate values to es-ES-x-medieval (iso639_3 osp, 1200–1500): six-sibilant system per Penny (2002) §§3.1–3.3 — c/ç→/ts/, z→/dz/, x→/ʃ/ (dixo), j/g+e,i→/ʒ/, ss→/s/ (intervocalic voiceless vs -s-→/z/ voiced), ll→/ʎ/ (pre-yeísmo), v→/v~β/ (pre-betacism). Removes anachronistic /θ/ and /x/ (not attested before c.1600). Adds ç and ss graphemes. Updates notes with the six-sibilant synopsis and Penny page references. Update two test assertions in TestMedievalSpanish that pinned the wrong modern values (c→θ, z→θ); replace with correct period values (c→ts, z→dz). All other medieval and modern-variety tests unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): seseo varieties no longer inherit castilian c->theta positional rule 13 Latin American and regional Spanish specs (es-419, es-AR, es-BO, es-CL, es-CO, es-CO-x-costa, es-CO-x-paisa, es-CR, es-CU, es-DO, es-EC, es-ES-x-canarias) add positional_graphemes.c = {before_e:[s], before_i:[s]} to override the c->θ rule inherited via positional_graphemes_base=es-ES (itself from es-ES-x-medieval). Western Andalusian (es-ES-x-andalusia-w) gets ["s","θ"] to reflect its seseo/ceceo variation documented in its notes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove orphaned palatal-lateral allophones in yeista varieties es-BO and es-EC had allophones["ʎ"] entries but ll->["ʝ"] only, making ʎ unreachable. Both now map ll->["ʝ","ʎ"] so the allophone table is reachable. es-EC additionally adds the characteristic Ecuadorian retroflex [ʒ] to the ʎ allophone set. es-ES-x-cantabria gets a graphemes block with ll->["ʎ","ʝ"] since Cantabrian preserves partial distinction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): modern Spanish acute marks are stress-only, é->e / ó->o in es-ES Fixes es-419 and all 31 es-* descendants: é was resolving to [ɛ] and ó to [ɔ] (Medieval Spanish vowel-quality values), but modern Spanish has a strict 5-vowel system; the acute accent is a stress/disambiguation diacritic only. Added é->["e"] and ó->["o"] to es-ES.json (keeping ü->["w"]); the medieval spec is untouched because [ts]/[dz] sibilant values belong to a separate fix. Updated test_iberian.py: test_accented_e/test_accented_o now assert the corrected values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Galician (gl) distinción, nh->ŋ, remove nasal-vowel digraphs RAG-norm Galician uses distinción (c/z→θ), not seseo. Previous spec encoded c→[k,s] / z→[z,s] which was wrong. Now: c→["k","θ"], z→["θ"]; positional c before_e/i → ["θ"]; positional z block deleted (uniform θ from base grapheme). nh now maps to ["ŋ"] as in RAG standard (previously null). Nasal-vowel digraph graphemes (an/en/in/on/un/ã/ão/ãi) and their allophone keys deleted — not in RAG orthography. gl-ES gets nh->["ŋ","ɲ"] (covers both RAG and reintegrationist readers). Updated test_iberian.py: all gl seseo/nasal tests updated to correct distinción values; test_distincion_vs_seseo moves gl to the distinción set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove EP spirantization from Brazilian varieties, fix pt-PT/pt-BR-x-* Seven pt-BR-x-* dialect specs (bahia, brasilia, caipira, ce, fluminense, mg, norte) had allophones b->["b","β"] and ɡ->["ɡ","ɣ"] copied from European Portuguese — intervocalic spirantization is an EP feature absent from Brazilian Portuguese. Allophones corrected to b->["b"] and ɡ->["ɡ"]. pt-BR-x-bahia and pt-BR-x-norte: Salvador/Belém speech does palatalise /t,d/ before /i/; notes claiming 'no palatalisation' were wrong. Added t->["t","tʃ"] and d->["d","dʒ"] allophones. pt-BR-x-recife: resolved internal contradiction (t/d graphemes listed ["t","tʃ"]/["d","dʒ"] but allophones did the palatalisation — graphemes simplified to ["t"]/["d"]). pt-BR: coda r now ["ʁ","ɾ"] (not sole deletion "∅"); word_final r adds "∅" as variant since coda r is audible in careful/formal speech. pt-PT: coda s -> ["ʃ","ʒ"] (ʃ first as canonical EP value); pretonic e -> ["ɨ"] (correct central vowel, not ["i"]). pt-PT-x-madeira: replaced speculative stop-aspiration allophones with the two attested features: l->["l","ʎ"] and i->["i","ɐj"]. pt-PT-x-minho/trasosmontes: added betacism v->["v","b","β"]; Transmontano also gets graphemes["ch"]=["t͡ʃ","ʃ"] for preserved affricate. Updated test_romance_extended2.py: b_has_beta/g_has_gamma/t_no_palatal/ d_no_palatal tests inverted to correct assertions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes — accented vowels (it-IT/co/pap/is/fo), fy û/ú swap it-IT and its seven regional sub-specs (abruzzo, calabria, marche, puglia, roma, toscana, umbria): accented vowels à/è/é/ì/ò/ó/ù added to graphemes (they inherit independently and none overrides locally). co (Corsican): added grave-accented vowels à/è/ì/ò/ù; added trigraphs chj->["c"] and ghj->["ɟ"] with matching allophones (the defining palatal graphemes of standard Corsican orthography). pap (Papiamento): added è->ɛ, ò->ɔ, ù->ʏ, ü->y with allophones ʏ/y. is (Icelandic): added missing æ->["ai"]; ó corrected from ["oː"] to ["ou"] (Icelandic ó is a diphthong per Árnason 2011); allophone oː removed, ou added. fo (Faroese): ð corrected to ["∅","j","v","w"] (not a consonant — always deleted or glides); þ removed entirely (not in Faroese alphabet); allophone θ removed; ei->["aɪ"] and oy->["ɔɪ"] fixed. fy (West Frisian): û and ú had swapped values; now û->["uː"] and ú->["yː"]. Updated test_germanic.py (o_acute_is_long_mid->diphthong, u_circumflex, g_includes_velar_fricative) and test_celtic.py (circumflex_y_gives_long_schwa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): wrong inherited vowel values in gallo-italic varieties (pms/lij o->u) pms (Piedmontese) and lij (Ligurian/Genoese) inherited "o"->["o","ɔ"] from la-x-galloitalic, but both languages raise this vowel to /u/ in standard orthography. Both specs now override o->["u"], adding ò->["ɔ"] as the explicit open-mid variant. pms also gets eu->["ø"], nasal digraphs changed to ŋ sequences (an->["aŋ"], en->["ɛŋ"], on->["uŋ"], in->["iŋ"]), and nasal-vowel allophones removed. lij gets eu/êu->["ø"], æ->["ɛː"], ç->["s"], and adds allophone ɛː. roa-x-galaicopt: deleted spurious positional_graphemes blocks for a, e, o that were forcing wrong vowel qualities (coda ɐ, stressed-only ɛ/ɔ). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes in Romance/Italic varieties ca: added tz->["dz"] and ts->["ts"] digraphs (dotze, setze, organitzar). sc (Sardinian): x corrected to ["ʒ"] (not ʃ); tz->["ts"] (was malformed "tː s"); j->["j","dʒ"] (glide first as standard value). vec: added ł->["ɰ","l","∅"] (L tajà/evanescent l, a Venetian signature). rm: tg corrected to ["tɕ"] (not dʒ); gh->["ɡ"] added. fur: ç->["tʃ"], gh->["ɡ"], ss->["s"] (official Grafie ufficiale graphemes). lld (Ladin): j corrected to ["ʒ"] (not ["dʒ"]); allophone ʒ added. lad (Judeo-Spanish): v restored to ["v"] (preserves labiodental); sh->["ʃ"], ny->["ɲ"] added (AY official orthography graphemes). oc-x-aranes: ó corrected to ["u"] (classical Occitan ó reads [u]). ext: g before_e/i corrected from ["s"] to ["h","x"] (Extremaduran velar fricative, not the nonexistent sibilant); added intervocalic ["ɣ"]. an-x-oriental: added positional_graphemes for g (before_e/i->["dʒ"]) and c (before_e/i->["s"]) — the spec's defining Aragonese sibilant features. osc: í->["eː","ɪ"] and ú->["oː","ʊ"] (native Oscan vowel quality). la: added ae->["aj"] and oe->["oj"] diphthongs. la-x-gallia: deleted ce/ci entries (positional rule already handles ce/ci->ts). pcd (Picard): ch corrected to ["ʃ"]; tch->["tʃ"] added. wa (Walloon): å->["ɔː"] and tch->["tʃ"] added. Updated test_iberian_extended.py: ext g-before-e test corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Asturian/Leonese dialect feature misattributions and missing rules ast-x-occidental: deleted misattributed f-word-initial aspiration rule and h-phonemic override (F-aspiration belongs to Eastern Asturian; Western Asturian conserves Latin F-); deleted allophones block. Base ast handling restored. ast-x-oriental: added positional_graphemes f.word_initial->["h","f"] (F-aspiration is the Eastern dialect's defining feature, not Western). ast-x-leon: added j->["ʒ","d͡ʒ"]; positional g before_e/i->["ʒ"] with intervocalic ɣ restated (child-overwrites-parent shallow merge); allophones ʒ/d͡ʒ added. ast-x-sanabria: added distinción graphemes (c/z/ç->θ, positional c->θ); null overrides for positional s/z to block inherited seseo-like rules. mwl (Mirandese): added c/z graphemes with seseo values (s̻/z̻) and positional rules; added g/j->ʒ and positional g before_e/i->ʒ with intervocalic ɣ. Updated test_iberian_extended.py: occidental h-phonemic and f-word-initial tests corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Germanic phonology corrections (af/de-DE/enm/nl/sv/nds/lb) af: g->["x","ɡ"] (Afrikaans g is voiceless velar/uvular fricative, not ɣ). de-DE: ch positional corrected — after_vowel->["x","ç"], word_initial-> ["k","ç"], default->["ç"]; replaces the incorrect after_front_vowel/ after_back_vowel distinctions which don't exist in GraphemePosition enum. enm (Middle English): removed IPA-symbol consonant keys (tʃ/dʒ/θ/ð/ʃ/ʒ/ ŋ/x were not graphemes); added orthographic entries th/þ/ȝ/gh/sch/sh/c/ch/ qu/wh; e corrected from ["ə"] to ["ɛ","e","ə"] (the most frequent ME vowel). nl/nl-NL: added sch positional block — word_final->["s"] (so -isch/-sch words get [s], not [sx]); default remains ["sx"]. sv: deleted k.before_o (Swedish k does not soften before back vowel o); g.word_final corrected from ["j"] to ["ɡ","j","∅"] (majority value first). nds: sp->["sp","ʃp"] and st->["st","ʃt"] (Northern Low Saxon has both). lb (Luxembourgish): added long vowels aa/ee/oo/ii/uu and diphthongs éi/ou/ue/ie/äi; fixed ei->["ɑɪ"], added ai->["ɑɪ"]; removed äu. Updated test_germanic.py accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): rename xsb -> gem-x-suebi (xsb is ISO 639-3 for Sambal, Philippines) The code 'xsb' is assigned to Sambal (Austronesian, Philippines) in ISO 639-3. Suebi/Suevi has no ISO 639-3 code; correct private-use code is gem-x-suebi matching the gem (Germanic) parent and the X-private namespace convention used by other reconstructed/proto specs. Updated all ancestor references (gl.json, roa-x-galaicopt.json, etc.). Updated test_language_integrity.py exclusion list: gem-x-suebi replaces xsb (stub with empty graphemes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Arabic/Iranian/Indic script and grapheme corrections arb: added haraka-first vowel digraph keys (correct Unicode order for long vowels اَ→اَ); added hamza carriers ؤ/ئ->["ʔ"]. ar-TD: reclassified parent/bases to ar-x-mashriqi (Chadian Arabic is Sudanic/Eastern Arabic, not Gulf). fa: added hamza letters ء/أ/ؤ/ئ->["ʔ"]. fa-AF (Dari): ی and و now include mater lectionis readings ["j","iː","eː"] / ["v","uː","oː"]; allophones eː/oː added. fa-x-early: corrected script->Latin/script_type->alphabet (spec covers scholarly transliteration, not Perso-Arabic script). ps (Pashto): ي->["j","i"], ی->["j","i","ai"] (five-ye distinction); ه->["h","a","ə"], ۀ->["ə"] added. peo (Old Persian): added ç->["ç"] and j->["dʒ"]; allophone ʒ added. ur: ے corrected to ["eː","ɛː"] (baṛī ye is word-final /eː/, not /j/); و/ی gain mater lectionis readings; allophones eː/ɛː/oː/ɔː added. pa: added tone_inventory dict (Punjabi is phonemically tonal). pa-PK: removed graphemes_base='pa' (was inheriting Gurmukhi table into Shahmukhi/Arabic-script spec; Shahmukhi table deferred to a separate PR). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Slavic/Armenian/Baltic/Yiddish grapheme and allophone corrections ru: deleted г.before_e/before_i->["v"] positional rules (г before е/и is not /v/; that reading only applies to historical г-on-the-page = v in specific words; the default [ɡ] already handles it). cs: added Czech soft-reading digraphs dě/tě/ně/mě/di/ti/ni/dí/tí/ní. sk: added ä->["ɛ","æ"] and Slovak soft-reading digraphs de/te/ne/le/ di/ti/ni/li/dí/tí/ní/lí with palatal-first ordering. be: deleted дь/ть/рь graphemes (these digraphs don't exist in Belarusian orthography; soft sign is written differently); allophone rʲ removed. uk: removed spurious Russian-pattern final devoicing allophones b/d/dʲ/ɡ/z/zʲ/ʒ (Ukrainian does not have final devoicing). cu (Old Church Slavonic): renamed "льь"→"ль" and "рьь"→"рь" (doubled soft-sign typos); є corrected to ["e"] (OCS value); ѥ->["je"] added. rue: г corrected to ["ɦ","ɣ"] (Rusyn г is the fricative, like Ukrainian, not the stop; ґ remains ["ɡ"]). hy: added օ->["o"] (Armenian letter, distinct from latin o). lt: added ch->["x"]; h->["ɣ","x"]. yi (Yiddish): א corrected to ["","a"] (silent primary); י->["j","i"]; added YIVO graphemes ײַ/טש/דזש/זש/יִ/ײ/װ. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): South/Southeast Asian script metadata and grapheme corrections ks (Kashmiri): script corrected to Latin (spec covers romanization); voiced aspirates remapped to plain (Kashmiri deaspiration reflex); added ts/tsh dental affricates. sd (Sindhi): script corrected to Latin; added missing nasals n/ṇ/ñ/ṅ. bho (Bhojpuri): inherent_vowel corrected from "ə" to "a" (अ->["a"]). as (Assamese): added Assamese-specific letters ৰ->["r"] and ৎ->["t"]. bn (Bengali): added khanda ta ৎ->["t"] (word-final /t̪/). mr (Marathi): च/ज/झ now each have two candidates: palatal first (tɕ/dʑ/dʑʱ), dental second (ts/dz/dzʱ); dental allophones added. or (Odia): added ଢ଼->["ɽʱ"] (the missing aspirated retroflex flap). sa-x-vedic: deleted ॾ->["ɖ"] (U+097E is not a standard Devanagari letter). ml (Malayalam): ഴ corrected from ["z"] to ["ɻ"] (retroflex approximant, not a sibilant); chillu letters ൺ/ൻ/ർ/ൽ/ൾ/ൿ added. kn (Kannada): ಱ corrected to ["r"]; archaic ೞ->["ɻ"] added. te (Telugu): ఱ corrected from ["r̝"] to ["r"]. tcy (Tulu)/brx (Bodo)/mni (Meitei): script corrected to Latin, script_type to alphabet, inherent_vowel removed (specs cover Latin romanization, not native scripts). sat (Santali): script corrected to Latin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Celtic/Celtic-area/reconstructed language grapheme corrections br (Breton): added eu->["œ","ø"] (core vowel grapheme); nasal vowels añ/eñ/iñ/oñ/uñ added with corresponding allophones. cy (Welsh): ŷ corrected from ["əː"] to ["ɨː","iː"] (long clear y, North Welsh); allophone əː removed; ngh->["ŋ̊"] added (nasal mutation of c). ga (Irish): added vowel digraphs ao/eo/ia/ua/ae/aoi/eoi; added eclipsis (urú) digraphs mb/gc/nd/bp/dt/bhf/ts. gd (Scottish Gaelic): added ao/aoi->["ɯː"]; allophone ɯː added. gv (Manx): added çh->["tʃ"] (the distinctive Manx digraph); allophone tʃ added. se (Northern Sami): added c->["ts"] and z->["dz"] (two of the 29 standard letters were missing). rup (Aromanian): added lj->["ʎ"] and nj->["ɲ"] (palatal sonorants that distinguish Aromanian from Romanian). xbr (Common Brythonic): ll->["lː"] (not ɬ — Proto-Brythonic had geminate lateral, not voiceless; voiceless ɬ and r̥ are Western Brythonic innovations); rh removed; allophones ɬ/r̥ removed, lː added. xtg/xcg (Gaulish/Cisalpine Gaulish): added p->["p"] (P-Celtic *kʷ>p). xga (Galatian): script corrected to Latin. sem (Proto-Semitic): deleted f (reconstructed *p, not *f). Updated test_celtic.py: circumflex-y test updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): miscellaneous phonology corrections across diverse languages fi (Finnish): removed coda keys from positional_graphemes for k/p/t (consonant gradation is a morphophonological alternation, not a surface positional rule; coda context alone doesn't determine gradation). hu (Hungarian): removed word_final/coda devoicing positional rules (Hungarian does not have systematic final obstruent devoicing). ts (Xitsonga): deleted xi digraph (yields wrong [ʃ] — x alone handles it); added hl->["ɬ"], ndz->["ndz"], n'w->["ŋw"]. sw (Swahili): ng->["ŋɡ"] (not ["ŋ"]); positional ng block deleted (redundant once base is fixed). ny (Chichewa): ng->["ŋɡ"] (default is prenasalized, not bare ŋ). ff (Fula): bh->["ɓ"] and dh->["ɗ"] (implosives, not fricatives β/ð); spurious positional b/dh blocks removed; allophones ð/β removed. tet (Tetum): added apostrophe->["ʔ"] (ASCII and typographic, INL official). csb (Kashubian): added ł->["w"], ż->["ʒ"], ã->["ã"], ò->["wɛ"]. szl (Silesian): added ż->["ʒ"], ã->["ã"], õ->["õ"]. hsb (Upper Sorbian): added ó->["o","ʊ"] and ř->["ʃ"]. tr (Turkish): deleted k.before_o from positional_graphemes (o is a back vowel; k before o is plain [k], not the palatal [c]). ira (Proto-Iranian): added č->["tʃ"] and ǰ->["dʒ"] with allophones. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): eu ts->ts̺, rename mcm->mzs, kha grapheme fixes eu: ts corrected to ["ts̺"] (apical affricate, distinct from ts̻ which is ⟨tz⟩); allophone ts̺ added. Updated test_iberian.py: test_ts_laminal-> test_ts_apical. mcm: renamed to mzs.json — 'mcm' is ISO 639-3 for Mochica (extinct pre-Columbian Peruvian language); the spec describes Macanese Creole whose correct code is mzs. Updated pt-MO.json ancestor reference. kha (Khasi): added j->["dʒ"], ñ->["ɲ"], ï->["j"], ph->["pʰ"], th->["tʰ"] (core Khasi Latin-alphabet graphemes); y corrected to ["ə","ʔ"] (presyllable schwa/glottal, not palatal glide); allophones dʒ/ɲ/pʰ/tʰ/ə/ʔ added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): CJK grapheme corrections (zh/ko/ja) zh: renamed "en-GB" key to "en" (was a JSON parse artifact); added iao->["iau"] and uai->["uai"] (standard pinyin finals missing from spec). ko: ㄺ corrected to ["k"] and ㄿ to ["p"] (final-cluster jamo values that contradict standard Korean codaification; ㄺ final clusters simplify to /k/, not /l/). ja: added 33 katakana yōon digraphs (キャ/キュ/キョ…リョ) mirroring the existing hiragana yōon entries; katakana is standard Japanese orthography and was entirely absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): stage xsb deletion and pt-MO ancestor reference update Delete xsb.json (superseded by gem-x-suebi.json); update pt-MO.json ancestor reference from mcm to mzs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Romance family - roa-x-galaicopt: fix ã→ã (medieval value), add s̺/z̺ allophone keys, remove unreachable ʒ allophone - sc/scn/vec: add grave-accented vowels (à è ì ò ù) for stressed and oxytone forms; scn/vec use dialect-specific mid vowel values - fur: fix h=∅ to empty string (dataset convention) - co: same ∅ fix; add sg→ʒ digraph with ʒ allophone - fr-FR: add j to y, add il reading to ill (ville, yeux coverage) - gl-ES: fix ç to [θ,s] in reintegrationist distinción variety - oc: fix parent to la-x-gallia (matches ancestors entry) - oc-x-aranes: fix iu→iw diphthong, v→[b,β] Gascon betacism, fix glottolog_code aran1237→aran1260 - ext: fix parent to es-ES (matches ancestors and *_base) - es-MX-x-costa: add x→[h,x] allophone (documented coastal feature) - es-PE: add ʎ to ll candidates (Andean lleísta register) - es-ES-x-murcia: remove dangling ʎ allophone (unreachable) - es-CO/es-CR: override s to laminal [s] (not Castilian apico-alveolar) - es-CO-x-costa: remove Puerto Rican ʁ/x from Colombian coastal r - ext-PT-x-barrancos: add aspiration/elision to coda/final s - ca-x-valencia: remove narrow s̺/z̺ from phonemic grapheme layer - egl: restore u as candidate for grapheme u (alongside y) - nrf: add tch/dg/th digraphs for Insular Norman (Jèrriais/Dgèrnésiais) - ast: fix yy to [kʲ] (pre/palatal stop, not literal "ky") - ast-x-cantabrian: add ḥ→h (aspirated F- feature) - an: expand family to full path, correct seseo/distinción note - mxi: fix wikipedia to Mozarabic language article - Update pinned test values for ast yy, an family, es-ES-x-murcia ll, gl-ES ç to match corrected data Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Germanic family - is: add nː to nn candidates (pre-stopping only post-accent); add hv→[kv,xv] digraph (hvaðh, hvernig etc.) - nl: fix y to [i,j,ei̞] (loanword coverage); fix ieu→[jøː] (milieu), add ieuw→[iːw] (nieuw) - nl-NL: add eeuw→[eːw] glide-final sequence (leeuw, sneeuw) - sv: add kj→ɕ digraph (kjol, parallel to tj set) - osx: remove anachronistic ʃ from sk (sk→ʃ is Middle Low German, not Old Saxon 800–1150 CE); remove now-unreachable ʃ allophone - non: mark æ→æː and œ→œː as inherently long per Gordon/Noreen; update allophone keys accordingly - da: fix parent gem→non (matches ancestors entry; cf. Faroese) - de-AT: rename allophone key r→ʁ (de-DE base keys rhotic as ʁ; old r key was dead); clear wrong glottolog_code aust1239 (Australian English code, not Austrian German) - de-CH: rename r→ʁ; remove inert keys ʔ/pː/tː/kː - de-x-bavarian: rename r→ʁ allophone key - gem-x-north: set script IPA-reconstruction/reconstruction (attested in Elder Futhark, consistent with parent gem) - got: set script to Gothic (Codex Argenteus; Latin was transliteration) - goh: add tsː and iɑ allophone keys (produced by zz/ia graphemes but missing from allophone table) - en-AU: fix GOAT allophone to [əʉ,əʊ]; add MOUTH aʊ→[æɔ,aʊ] - en-GB-x-scotland: add x to ch candidates to reach documented /x/ - af: fix ei→[ɛi] (matches y→[ɛi]); fix sch→[sk] (Afrikaans value, not Dutch [sx]) - se: fix á to [aː] (marks long vowel in Northern Sami, not same as a) - Update test_rhotic assertions for de-AT/de-x-bavarian to check ʁ key Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Slavic and Armenian - cs/uk/pl: fix parent from ine to sla (10/13 Slavic specs use sla; these three were skipping the Proto-Slavic intermediate) - be: add non-iotated post-consonantal candidates е→[jɛ,ɛ] and ё→[jɔ,ɔ] (ru spec models this; without it every post-consonantal е/ё gets a spurious [j]) - sl: remove bogus ə grapheme (Slovenian has no letter ə; schwa is written with e); add ə to e candidates [ɛ,e,ə] - hsb: fix ě→[e] (Upper Sorbian ě is a mid vowel, not [jɛ]) - dsb: same ě fix; add ŕ→[rʲ] (Lower Sorbian palatalized r letter); set parent→null (Lower Sorbian is sister, not descendant, of Upper) - hy: deduplicate ո to single ["o","vo"] entry (duplicate JSON key; parser silently dropped one) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Celtic family - cel-x-goidelic: add kʷ grapheme (Q-Celtic is defined by retaining *kʷ, not merging it; that merger is Old Irish-internal); fix script to IPA-reconstruction/reconstruction (matches sibling cel.json) - xtg: fix wikipedia link to Gaulish language article (was Gaul region) - xcg: fix wikipedia link to Cisalpine Gaulish article (was Galatian) - xga: add p grapheme (Galatian P-Celtic; attested petro- names); fix wikipedia link to Galatian language article (was Gaulish) - xlp: add p grapheme (kʷ→p is the classificatory hallmark of Lepontic; needed to transcribe attested pala, pliale etc.) - kw: add SWF vowel digraphs ou→[uː], eu→[œː,øː], oo→[oː] with matching allophone keys (front rounded vowels absent before this fix) - br: add y→[j] grapheme (yezh, yaouank; /j/ allophone existed but was unreachable from any grapheme) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Indic family - pa: fix addak ੱ from silent "" to gemination marker ː - as: remove dangling ʃ allophone (no grapheme produces ʃ; copied from bn.json); add s as allophone of x (cluster position) - sa-x-vedic: add silent graphemes for udātta ॑ and anudātta ॒ accent marks (ubiquitous in Rigveda text; were causing unknown-grapheme) - si: fix අ to [a,ə] (consistent with inherent_vowel="a"; ə is unstressed reduction only); add plain counterparts to mahaprana letters (aspiration is etymological in modern Sinhala) - ur: add hamza forms ء ئ ؤ ۂ (everyday Urdu spellings for hamza seat on alef/ye/waw; ۂ he-with-hamza) - iir: remove kʷ (no labiovelar series in PII; PIE *kʷ merged with k); add PII affricate series č/ǰ/ǰʰ; fix ∅ to "" in h allophones - tg: fix ӯ to [ɵ,uː] (standard Tajik ӯ is mid-central ɵ, reflex of Classical majhul ō; Perry 2005) - peo: remove ž (Old Persian had no /ʒ/; belongs to Avestan) - gu: align vocalic-ṛ matra ૃ and independent ઋ to same value r̩ - brx: add w→[ɯ] grapheme (sixth Bodo vowel); add tone_inventory (level/falling); add missing bʱ/dʱ/ɡʱ allophone keys - mni: add tone_inventory (falling/level) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Dravidian, Semitic, Italic, Berber, Niger-Congo Dravidian: - ta: fix āytam ஃ to [x,h,""] (velar/glottal fricative, not k) - kn: rename ɕ allophone key to ʃ (no grapheme produces ɕ; was a copy-paste from another spec) Semitic/Arabic: - arb: fix sun-letter sandhi regex to include l (ل is a sun letter) and ɮ; remove duplicate s - ar-NG: remove tone-marked short vowels from long vowel allophone sets (conflated incipient tone with vowel length) - ar-IQ-x-qeltu: fix ar-x-mashriqi ancestor role adstrate→ancestor (it is a lineal ancestor, not a contact variety) - xaa: add Andalusi Arabic dialect allophones (aː→[eː], θ→[t,s], ɮˤ/ðˤ merger); add missing script_type abjad; fix wikipedia link to Andalusi Arabic (was Nabataean, an unrelated language) - xpa: fix wikipedia to Old Arabic (was Italic languages) Italic/Ancient: - osc: add Oscan diphthong graphemes ai/ei/oi/au/ou (Buck 1904; all attested, all absent before this fix) - la-x-late: replace ASCII g (U+0067) with IPA ɡ (U+0261) in grapheme candidates (cross-spec consistency with la-x-archaic) - xda: fix iso639_3 xda→xdc (Dacian); add v→w and y→j graphemes (required for attested onomastic corpus) - xlg: fix wikipedia from Ancient Ligurian to Lusitanian language; add v→w grapheme (REVE, NAVIAE attested) - etr: note that input graphemes are Latin transliteration of the Old Italic (Etruscan) alphabet Berber/Afroasiatic: - ber: fix script Latin/alphabet (graphemes are romanization, not Tifinagh) - kea: add ALUPEC letter n̈→ŋ (velar nasal, standard ALUPEC but absent; only ng digraph was present) Niger-Congo/Creole: - ff: fix kk→[kː] (Fula geminates are plain long stops, not ʔk); add ŋ and glottal stop ' graphemes - ny: add ŵ→[β] and tch→[tʃʰ] graphemes - aoa: add tone_inventory H/L (Angolar has lexical tone per Maurer 1995) - pre: add tone_inventory H/L (Principense has lexical tone per Maurer 2009) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): minor triage — Portuguese and Italian dialects - pt-PT: add mid-close e/o to nucleus_stressed positional (e/o are both /ɛ~e/ and /ɔ~o/ in stressed position) - pt-PT-x-madeira/acores/caipira/AO: set glottolog_code null (codes were fabricated placeholders or wrong languoids) - pt-PT-x-porto: add v→[v,b] betacism allophone - pt-PT-x-viana: add v betacism; add ch→[tʃ,ʃ] affricate variant - pt-ST: add positional overrides for coda/final s→[s] and z→[s] (Sãotomense does not palatalise coda sibilants like European PT) - pt-CV/pt-GW: add coda s and unstressed e overrides (alveolar coda s, no ɨ reduction; inherited pt-PT rules were wrong) - pt-MZ: add coda s and word-initial r overrides - pt-MO: fix Cantonese ancestor code zh→zh with note about yue (yue.json not yet in registry; annotated for future migration) - pt-BR-x-sul: add alveolar trill r to r/rr candidates (Gaúcho feature noted in spec but absent from grapheme candidates) - it-IT-x-roma/abruzzo/toscana: set glottolog_code null (roma1327=Romanian, abru1238=retired, tosc1239=404) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(data): author pa-PK Shahmukhi grapheme table Full Shahmukhi (Perso-Arabic/Nastaliq) grapheme table for Western Punjabi (iso639-3: pnb), quality=research. Coverage: - Core Urdu-inherited consonants: ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن ھ - Punjabi-specific retroflexes: ٹ (ʈ) ڈ (ɖ) ڑ (ɖ retroflex flap) - Aspiration digraphs with do-chashmi-he ھ: کھ گھ چھ ٹھ تھ پھ بھ دھ ڈھ جھ رھ ڑھ لھ مھ نھ (all 15 standard Shahmukhi aspirates) - Vowel matras and independent forms: ا آ و ی ے plus harakat diacritics - Nasalisation: nun-ghunna ں → ̃ (nasalises preceding vowel) Tone inventory documented (H/L/M three-way system): - Historic voiced aspirates (بھ جھ ڈھ دھ گھ) yield segmental plain stop + low/falling tone (not murmur); tonal split fully described in notes - Historic voiceless aspirates (پھ چھ ٹھ تھ کھ) yield aspirate + high/rising tone - Level/mid tone (M) is the default Sources: Shackle (1972, 2003 in Cardona & Jain), Bhatia (1993), Masica (1991). Full-alphabet coverage achieved from two primary sources. Tier: research (full-alphabet, two sourced references). Stacks on removal of broken Gurmukhi graphemes_base from earlier commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): hu remove aspirated allophones (p/t/k unaspirated in Hungarian) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* fix: py3.9 annotation compatibility, plugin-failure logging, public exports - lm.py gains the future-annotations import its PEP 585 hints require under the declared python_requires >=3.9 - types.py union hints use Optional[...] like the rest of the codebase - plugin discovery logs a warning identifying the entry point when a G2P plugin fails to load instead of silently skipping it - get_plugin, G2PPlugin, WordContext and SandhiEngine join the public API exports - package description matches the shipped language-code count - new test module guards annotation compatibility, failure logging and the export surface * feat(registry): resolve bare language tags and nearest-match fallbacks - bare primary tags resolve to a curated reference variety (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias convention rather than langcodes' population-based default - unregistered regional tags fall back to the nearest registered code by language distance via ovos-spec-tools (en-NZ -> en-GB) - public resolve() exposes the resolution without loading a spec - _resolve_code is cached; a data-driven test asserts every registered code's bare primary subtag stays loadable * fix(data): occitan phonology, explicit quality tiers, verified wikipedia links - oc gains a full Lengadocian-norm inventory (56 graphemes, 29 allophones, positional lenition/final rules) sourced from Alibert and Wheeler; it was the only living language resolving to an empty spec - every spec now declares an explicit quality tier; the extinct metadata-only placeholders got, mxi and xsb are tiered stub - the 21 dead wikipedia links kept for review are replaced with MediaWiki-API-verified live articles and the audit tables updated - new guard tests: explicit tier on every spec, and non-stub specs must resolve to non-empty grapheme and allophone inventories * feat(plugin): word context fields, normalize/post_process hooks, priority dispatch - WordContext gains defaulted fields: orthographic prev/next word, sentence-final flag, word index/count and the resolved language code - G2PPlugin gains non-abstract hooks: normalize (pre-G2P text preparation), post_process (context-aware IPA adjustment) and a priority property used as dispatch tie-break - plugin discovery keeps the highest-priority plugin when several claim the same language code, regardless of registration order * feat(schema): declarative stress rules with detection and IPA marking - StressRules frozen dataclass on LanguageSpec (optional, own-file only), validated by a strict pydantic model in lockstep - new stress module: naive vowel-group syllabifier, detect_stress (marked vowels > oxytone endings > penult endings > default position) and apply_stress_mark with end-anchored alignment so orthographic and IPA syllable counts may differ - pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns and final -i/-u attract stress - consumers with a real syllabifier pass their own syllable list * feat(plugin): per-language syllabifier entry-point group - SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify entry-point group, priority-resolved like G2P plugins - detect_stress accepts lang= and uses the registered syllabifier for that language (silabificador for Portuguese, pycotovia for Galician) before falling back to the naive vowel-group splitter - plugin output is trusted only when its syllables rebuild the word - get_syllabifier exported from the package root * fix(data): accented vowel graphemes for portuguese and spanish trema - pt-PT and pt-BR map the acute and circumflex vowels (a/e/i/o/u + a/e/o) that modern orthography requires; words like 'ola' and 'cafe' previously dropped the accented vowel entirely - es-ES maps the diaeresis u of gue/gui sequences * feat: top-level G2P engine with greedy and beam search - G2P class composes the package pipeline: normalize hook -> tokenizer word split with pausal flags -> per-word greedy/beam candidate search -> stress marking -> word-context pass with plugin post_process -> sandhi -> dialect transform - module-level transcribe(text, lang) one-call API; per-word beaming avoids whole-sentence combinatorial growth; greedy == beam(1) - registered G2P plugins take over their languages automatically (normalize/transcribe_word/post_process driven by full WordContext); use_plugins=False forces the data-driven path - CLI transcribe rides the engine: --search greedy|beam, --beam-width, --dialect-profile, --no-plugins; README quick-start leads with transcribe() - invariant test: specs declaring stress rules must map their marked vowels as graphemes * feat(plugin): sentence-level plugin dispatch - G2PPlugin.sentence_level property (default False): when True the engine hands the full normalized text to plugin.transcribe instead of driving transcribe_word per word, for plugins whose quality depends on sentence-wide state (POS tagging, clitic joining) - the plugin then owns context effects and sandhi; the engine still applies the dialect transform and aligns per-word IPA best-effort * Revert "feat(plugin): sentence-level plugin dispatch" This reverts commit 676ce5f. * refactor(g2p): self-contained data-driven engine - the engine never loads external G2P implementations: downstream engines (arbtok, tugaphone) consume this library and own their own pipelines - drop plugin dispatch, use_plugins and the per-word context pass; normalize is a caller-supplied callable - WordTranscription loses the source field; CLI drops --no-plugins * refactor!: remove external G2P loading and the bundled arabic plugin - orthography2ipa never loads full G2P engines: the orthography2ipa.g2p entry-point group, plugin discovery and get_plugin are removed, and PhonetokTokenizer no longer takes a plugin to delegate to - the bundled arabic_g2p plugin and the tashkeel stub are removed; Arabic transcribes through the data-driven engine, and arbtok is the downstream Arabic engine built on this library - G2PPlugin and WordContext remain exported as the base types downstream engines implement; component plugins that slot into the engine's own logic keep their dedicated groups (orthography2ipa.syllabify) * docs: describe the consumer-engine architecture in the README * feat(data): stress rules for gl, mwl and barranquenho; galician positional phonology - gl: add stress block (Cotovia/GTM rules — written accent wins; vowel/n/s-final → penultimate; consonant-final r/l/z/x/d → oxytone); add Cotovia source entry; expand positional_graphemes with word_initial plosive realisations for b/d/g and word_final ŋ for n (Galician velarisation) - mwl: add stress block (western Iberian paroxytone default; r/l/z/ç/im/um/ns/ão endings → oxytone; tilde vowels as accent-bearers); fix j→ʒ and ç→s̻ overriding the erroneous ast-inherited values ʝ/t͡s - ext-PT-x-barrancos: add stress block derived from g2p_barranquenho _stressed_index() logic (accent override → paroxytone; vowel/vowel+s-final → paroxytone; other consonant-final → oxytone) - tests: extend test_stress.py with gold cases for gl (10), mwl (8 incl. divergence checks), and barranquenho (8) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): ground mwl and barranquenho stress rules in the orthographic conventions - mwl: final nasal endings are written -n in Mirandese (Asturleonese trait) — use in/un/on (camin, naçon < Lat. -ōnem) instead of the Portuguese-only im/um; keep word-final ç (rapaç, lhuç) as an oxytone ending; document the six-sibilant system (apical s̺/z̺, laminal s̻/z̻, postalveolar ʃ/ʒ) motivating j→ʒ and ç→s̻; add Vasconcelos 1900 and Convenção Ortográfica da Língua Mirandesa 1999 sources - ext-PT-x-barrancos: notes grounded in the Portuguese accentuation norms adopted by the Convenção Ortográfica do Barranquenho; unmarked -em/-am stay paroxytone - tests: mwl gold cases for rapaç/camin/naçon and -n vs -m ending assertions; barranquenho homem/falam paroxytone cases Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): seseo varieties no longer inherit castilian c->theta positional rule 13 Latin American and regional Spanish specs (es-419, es-AR, es-BO, es-CL, es-CO, es-CO-x-costa, es-CO-x-paisa, es-CR, es-CU, es-DO, es-EC, es-ES-x-canarias) add positional_graphemes.c = {before_e:[s], before_i:[s]} to override the c->θ rule inherited via positional_graphemes_base=es-ES (itself from es-ES-x-medieval). Western Andalusian (es-ES-x-andalusia-w) gets ["s","θ"] to reflect its seseo/ceceo variation documented in its notes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove orphaned palatal-lateral allophones in yeista varieties es-BO and es-EC had allophones["ʎ"] entries but ll->["ʝ"] only, making ʎ unreachable. Both now map ll->["ʝ","ʎ"] so the allophone table is reachable. es-EC additionally adds the characteristic Ecuadorian retroflex [ʒ] to the ʎ allophone set. es-ES-x-cantabria gets a graphemes block with ll->["ʎ","ʝ"] since Cantabrian preserves partial distinction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): modern Spanish acute marks are stress-only, é->e / ó->o in es-ES Fixes es-419 and all 31 es-* descendants: é was resolving to [ɛ] and ó to [ɔ] (Medieval Spanish vowel-quality values), but modern Spanish has a strict 5-vowel system; the acute accent is a stress/disambiguation diacritic only. Added é->["e"] and ó->["o"] to es-ES.json (keeping ü->["w"]); the medieval spec is untouched because [ts]/[dz] sibilant values belong to a separate fix. Updated test_iberian.py: test_accented_e/test_accented_o now assert the corrected values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Galician (gl) distinción, nh->ŋ, remove nasal-vowel digraphs RAG-norm Galician uses distinción (c/z→θ), not seseo. Previous spec encoded c→[k,s] / z→[z,s] which was wrong. Now: c→["k","θ"], z→["θ"]; positional c before_e/i → ["θ"]; positional z block deleted (uniform θ from base grapheme). nh now maps to ["ŋ"] as in RAG standard (previously null). Nasal-vowel digraph graphemes (an/en/in/on/un/ã/ão/ãi) and their allophone keys deleted — not in RAG orthography. gl-ES gets nh->["ŋ","ɲ"] (covers both RAG and reintegrationist readers). Updated test_iberian.py: all gl seseo/nasal tests updated to correct distinción values; test_distincion_vs_seseo moves gl to the distinción set. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): remove EP spirantization from Brazilian varieties, fix pt-PT/pt-BR-x-* Seven pt-BR-x-* dialect specs (bahia, brasilia, caipira, ce, fluminense, mg, norte) had allophones b->["b","β"] and ɡ->["ɡ","ɣ"] copied from European Portuguese — intervocalic spirantization is an EP feature absent from Brazilian Portuguese. Allophones corrected to b->["b"] and ɡ->["ɡ"]. pt-BR-x-bahia and pt-BR-x-norte: Salvador/Belém speech does palatalise /t,d/ before /i/; notes claiming 'no palatalisation' were wrong. Added t->["t","tʃ"] and d->["d","dʒ"] allophones. pt-BR-x-recife: resolved internal contradiction (t/d graphemes listed ["t","tʃ"]/["d","dʒ"] but allophones did the palatalisation — graphemes simplified to ["t"]/["d"]). pt-BR: coda r now ["ʁ","ɾ"] (not sole deletion "∅"); word_final r adds "∅" as variant since coda r is audible in careful/formal speech. pt-PT: coda s -> ["ʃ","ʒ"] (ʃ first as canonical EP value); pretonic e -> ["ɨ"] (correct central vowel, not ["i"]). pt-PT-x-madeira: replaced speculative stop-aspiration allophones with the two attested features: l->["l","ʎ"] and i->["i","ɐj"]. pt-PT-x-minho/trasosmontes: added betacism v->["v","b","β"]; Transmontano also gets graphemes["ch"]=["t͡ʃ","ʃ"] for preserved affricate. Updated test_romance_extended2.py: b_has_beta/g_has_gamma/t_no_palatal/ d_no_palatal tests inverted to correct assertions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes — accented vowels (it-IT/co/pap/is/fo), fy û/ú swap it-IT and its seven regional sub-specs (abruzzo, calabria, marche, puglia, roma, toscana, umbria): accented vowels à/è/é/ì/ò/ó/ù added to graphemes (they inherit independently and none overrides locally). co (Corsican): added grave-accented vowels à/è/ì/ò/ù; added trigraphs chj->["c"] and ghj->["ɟ"] with matching allophones (the defining palatal graphemes of standard Corsican orthography). pap (Papiamento): added è->ɛ, ò->ɔ, ù->ʏ, ü->y with allophones ʏ/y. is (Icelandic): added missing æ->["ai"]; ó corrected from ["oː"] to ["ou"] (Icelandic ó is a diphthong per Árnason 2011); allophone oː removed, ou added. fo (Faroese): ð corrected to ["∅","j","v","w"] (not a consonant — always deleted or glides); þ removed entirely (not in Faroese alphabet); allophone θ removed; ei->["aɪ"] and oy->["ɔɪ"] fixed. fy (West Frisian): û and ú had swapped values; now û->["uː"] and ú->["yː"]. Updated test_germanic.py (o_acute_is_long_mid->diphthong, u_circumflex, g_includes_velar_fricative) and test_celtic.py (circumflex_y_gives_long_schwa). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): wrong inherited vowel values in gallo-italic varieties (pms/lij o->u) pms (Piedmontese) and lij (Ligurian/Genoese) inherited "o"->["o","ɔ"] from la-x-galloitalic, but both languages raise this vowel to /u/ in standard orthography. Both specs now override o->["u"], adding ò->["ɔ"] as the explicit open-mid variant. pms also gets eu->["ø"], nasal digraphs changed to ŋ sequences (an->["aŋ"], en->["ɛŋ"], on->["uŋ"], in->["iŋ"]), and nasal-vowel allophones removed. lij gets eu/êu->["ø"], æ->["ɛː"], ç->["s"], and adds allophone ɛː. roa-x-galaicopt: deleted spurious positional_graphemes blocks for a, e, o that were forcing wrong vowel qualities (coda ɐ, stressed-only ɛ/ɔ). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): missing core graphemes in Romance/Italic varieties ca: added tz->["dz"] and ts->["ts"] digraphs (dotze, setze, organitzar). sc (Sardinian): x corrected to ["ʒ"] (not ʃ); tz->["ts"] (was malformed "tː s"); j->["j","dʒ"] (glide first as standard value). vec: added ł->["ɰ","l","∅"] (L tajà/evanescent l, a Venetian signature). rm: tg corrected to ["tɕ"] (not dʒ); gh->["ɡ"] added. fur: ç->["tʃ"], gh->["ɡ"], ss->["s"] (official Grafie ufficiale graphemes). lld (Ladin): j corrected to ["ʒ"] (not ["dʒ"]); allophone ʒ added. lad (Judeo-Spanish): v restored to ["v"] (preserves labiodental); sh->["ʃ"], ny->["ɲ"] added (AY official orthography graphemes). oc-x-aranes: ó corrected to ["u"] (classical Occitan ó reads [u]). ext: g before_e/i corrected from ["s"] to ["h","x"] (Extremaduran velar fricative, not the nonexistent sibilant); added intervocalic ["ɣ"]. an-x-oriental: added positional_graphemes for g (before_e/i->["dʒ"]) and c (before_e/i->["s"]) — the spec's defining Aragonese sibilant features. osc: í->["eː","ɪ"] and ú->["oː","ʊ"] (native Oscan vowel quality). la: added ae->["aj"] and oe->["oj"] diphthongs. la-x-gallia: deleted ce/ci entries (positional rule already handles ce/ci->ts). pcd (Picard): ch corrected to ["ʃ"]; tch->["tʃ"] added. wa (Walloon): å->["ɔː"] and tch->["tʃ"] added. Updated test_iberian_extended.py: ext g-before-e test corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Asturian/Leonese dialect feature misattributions and missing rules ast-x-occidental: deleted misattributed f-word-initial aspiration rule and h-phonemic override (F-aspiration belongs to Eastern Asturian; Western Asturian conserves Latin F-); deleted allophones block. Base ast handling restored. ast-x-oriental: added positional_graphemes f.word_initial->["h","f"] (F-aspiration is the Eastern dialect's defining feature, not Western). ast-x-leon: added j->["ʒ","d͡ʒ"]; positional g before_e/i->["ʒ"] with intervocalic ɣ restated (child-overwrites-parent shallow merge); allophones ʒ/d͡ʒ added. ast-x-sanabria: added distinción graphemes (c/z/ç->θ, positional c->θ); null overrides for positional s/z to block inherited seseo-like rules. mwl (Mirandese): added c/z graphemes with seseo values (s̻/z̻) and positional rules; added g/j->ʒ and positional g before_e/i->ʒ with intervocalic ɣ. Updated test_iberian_extended.py: occidental h-phonemic and f-word-initial tests corrected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Germanic phonology corrections (af/de-DE/enm/nl/sv/nds/lb) af: g->["x","ɡ"] (Afrikaans g is voiceless velar/uvular fricative, not ɣ). de-DE: ch positional corrected — after_vowel->["x","ç"], word_initial-> ["k","ç"], default->["ç"]; replaces the incorrect after_front_vowel/ after_back_vowel distinctions which don't exist in GraphemePosition enum. enm (Middle English): removed IPA-symbol consonant keys (tʃ/dʒ/θ/ð/ʃ/ʒ/ ŋ/x were not graphemes); added orthographic entries th/þ/ȝ/gh/sch/sh/c/ch/ qu/wh; e corrected from ["ə"] to ["ɛ","e","ə"] (the most frequent ME vowel). nl/nl-NL: added sch positional block — word_final->["s"] (so -isch/-sch words get [s], not [sx]); default remains ["sx"]. sv: deleted k.before_o (Swedish k does not soften before back vowel o); g.word_final corrected from ["j"] to ["ɡ","j","∅"] (majority value first). nds: sp->["sp","ʃp"] and st->["st","ʃt"] (Northern Low Saxon has both). lb (Luxembourgish): added long vowels aa/ee/oo/ii/uu and diphthongs éi/ou/ue/ie/äi; fixed ei->["ɑɪ"], added ai->["ɑɪ"]; removed äu. Updated test_germanic.py accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): rename xsb -> gem-x-suebi (xsb is ISO 639-3 for Sambal, Philippines) The code 'xsb' is assigned to Sambal (Austronesian, Philippines) in ISO 639-3. Suebi/Suevi has no ISO 639-3 code; correct private-use code is gem-x-suebi matching the gem (Germanic) parent and the X-private namespace convention used by other reconstructed/proto specs. Updated all ancestor references (gl.json, roa-x-galaicopt.json, etc.). Updated test_language_integrity.py exclusion list: gem-x-suebi replaces xsb (stub with empty graphemes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Arabic/Iranian/Indic script and grapheme corrections arb: added haraka-first vowel digraph keys (correct Unicode order for long vowels اَ→اَ); added hamza carriers ؤ/ئ->["ʔ"]. ar-TD: reclassified parent/bases to ar-x-mashriqi (Chadian Arabic is Sudanic/Eastern Arabic, not Gulf). fa: added hamza letters ء/أ/ؤ/ئ->["ʔ"]. fa-AF (Dari): ی and و now include mater lectionis readings ["j","iː","eː"] / ["v","uː","oː"]; allophones eː/oː added. fa-x-early: corrected script->Latin/script_type->alphabet (spec covers scholarly transliteration, not Perso-Arabic script). ps (Pashto): ي->["j","i"], ی->["j","i","ai"] (five-ye distinction); ه->["h","a","ə"], ۀ->["ə"] added. peo (Old Persian): added ç->["ç"] and j->["dʒ"]; allophone ʒ added. ur: ے corrected to ["eː","ɛː"] (baṛī ye is word-final /eː/, not /j/); و/ی gain mater lectionis readings; allophones eː/ɛː/oː/ɔː added. pa: added tone_inventory dict (Punjabi is phonemically tonal). pa-PK: removed graphemes_base='pa' (was inheriting Gurmukhi table into Shahmukhi/Arabic-script spec; Shahmukhi table deferred to a separate PR). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Slavic/Armenian/Baltic/Yiddish grapheme and allophone corrections ru: deleted г.before_e/before_i->["v"] positional rules (г before е/и is not /v/; that reading only applies to historical г-on-the-page = v in specific words; the default [ɡ] already handles it). cs: added Czech soft-reading digraphs dě/tě/ně/mě/di/ti/ni/dí/tí/ní. sk: added ä->["ɛ","æ"] and Slovak soft-reading digraphs de/te/ne/le/ di/ti/ni/li/dí/tí/ní/lí with palatal-first ordering. be: deleted дь/ть/рь graphemes (these digraphs don't exist in Belarusian orthography; soft sign is written differently); allophone rʲ removed. uk: removed spurious Russian-pattern final devoicing allophones b/d/dʲ/ɡ/z/zʲ/ʒ (Ukrainian does not have final devoicing). cu (Old Church Slavonic): renamed "льь"→"ль" and "рьь"→"рь" (doubled soft-sign typos); є corrected to ["e"] (OCS value); ѥ->["je"] added. rue: г corrected to ["ɦ","ɣ"] (Rusyn г is the fricative, like Ukrainian, not the stop; ґ remains ["ɡ"]). hy: added օ->["o"] (Armenian letter, distinct from latin o). lt: added ch->["x"]; h->["ɣ","x"]. yi (Yiddish): א corrected to ["","a"] (silent primary); י->["j","i"]; added YIVO graphemes ײַ/טש/דזש/זש/יִ/ײ/װ. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): South/Southeast Asian script metadata and grapheme corrections ks (Kashmiri): script corrected to Latin (spec covers romanization); voiced aspirates remapped to plain (Kashmiri deaspiration reflex); added ts/tsh dental affricates. sd (Sindhi): script corrected to Latin; added missing nasals n/ṇ/ñ/ṅ. bho (Bhojpuri): inherent_vowel corrected from "ə" to "a" (अ->["a"]). as (Assamese): added Assamese-specific letters ৰ->["r"] and ৎ->["t"]. bn (Bengali): added khanda ta ৎ->["t"] (word-final /t̪/). mr (Marathi): च/ज/झ now each have two candidates: palatal first (tɕ/dʑ/dʑʱ), dental second (ts/dz/dzʱ); dental allophones added. or (Odia): added ଢ଼->["ɽʱ"] (the missing aspirated retroflex flap). sa-x-vedic: deleted ॾ->["ɖ"] (U+097E is not a standard Devanagari letter). ml (Malayalam): ഴ corrected from ["z"] to ["ɻ"] (retroflex approximant, not a sibilant); chillu letters ൺ/ൻ/ർ/ൽ/ൾ/ൿ added. kn (Kannada): ಱ corrected to ["r"]; archaic ೞ->["ɻ"] added. te (Telugu): ఱ corrected from ["r̝"] to ["r"]. tcy (Tulu)/brx (Bodo)/mni (Meitei): script corrected to Latin, script_type to alphabet, inherent_vowel removed (specs cover Latin romanization, not native scripts). sat (Santali): script corrected to Latin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): Celtic/Celtic-area/reconstructed language grapheme corrections br (Breton): added eu->["œ","ø"] (core vowel grapheme); nasal vowels añ/eñ/iñ/oñ/uñ added with corresponding allophones. cy (Welsh): ŷ corrected from ["əː"] to ["ɨː","iː"] (long clear y, North Welsh); allophone əː removed; ngh->["ŋ̊"] added (nasal mutation of c). ga (Irish): added vowel digraphs ao/eo/ia/ua/ae/aoi/eoi; added eclipsis (urú) digraphs mb/gc/nd/bp/dt/bhf/ts. gd (Scottish Gaelic): added ao/aoi->["ɯː"]; allophone ɯː added. gv (Manx): added çh->["tʃ"] (the distinctive Manx digraph); allophone tʃ added. se (Northern Sami): added c->["ts"] and z->["dz"] (two of the 29 standard letters were missing). rup (Aromanian): added lj->["ʎ"] and nj->["ɲ"] (palatal sonorants that distinguish Aromanian from Romanian). xbr (Common Brythonic): ll->["lː"] (not ɬ — Proto-Brythonic had geminate lateral, not voiceless; voiceless ɬ and r̥ are Western Brythonic innovations); rh removed; allophones ɬ/r̥ removed, lː added. xtg/xcg (Gaulish/Cisalpine Gaulish): added p->["p"] (P-Celtic *kʷ>p). xga (Galatian): script corrected to Latin. sem (Proto-Semitic): deleted f (reconstructed *p, not *f). Updated test_celtic.py: circumflex-y test updated. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): miscellaneous phonology corrections across diverse languages fi (Finnish): removed coda keys from positional_graphemes for k/p/t (consonant gradation is a morphophonological alternation, not a surface positional rule; coda context alone doesn't determine gradation). hu (Hungarian): removed word_final/coda devoicing positional rules (Hungarian does not have systematic final obstruent devoicing). ts (Xitsonga): deleted xi digraph (yields wrong [ʃ] — x alone handles it); added hl->["ɬ"], ndz->["ndz"], n'w->["ŋw"]. sw (Swahili): ng->["ŋɡ"] (not ["ŋ"]); positional ng block deleted (redundant once base is fixed). ny (Chichewa): ng->["ŋɡ"] (default is prenasalized, not bare ŋ). ff (Fula): bh->["ɓ"] and dh->["ɗ"] (implosives, not fricatives β/ð); spurious positional b/dh blocks removed; allophones ð/β removed. tet (Tetum): added apostrophe->["ʔ"] (ASCII and typographic, INL official). csb (Kashubian): added ł->["w"], ż->["ʒ"], ã->["ã"], ò->["wɛ"]. szl (Silesian): added ż->["ʒ"], ã->["ã"], õ->["õ"]. hsb (Upper Sorbian): added ó->["o","ʊ"] and ř->["ʃ"]. tr (Turkish): deleted k.before_o from positional_graphemes (o is a back vowel; k before o is plain [k], not the palatal [c]). ira (Proto-Iranian): added č->["tʃ"] and ǰ->["dʒ"] with allophones. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): eu ts->ts̺, rename mcm->mzs, kha grapheme fixes eu: ts corrected to ["ts̺"] (apical affricate, distinct from ts̻ which is ⟨tz⟩); allophone ts̺ added. Updated test_iberian.py: test_ts_laminal-> test_ts_apical. mcm: renamed to mzs.json — 'mcm' is ISO 639-3 for Mochica (extinct pre-Columbian Peruvian language); the spec describes Macanese Creole whose correct code is mzs. Updated pt-MO.json ancestor reference. kha (Khasi): added j->["dʒ"], ñ->["ɲ"], ï->["j"], ph->["pʰ"], th->["tʰ"] (core Khasi Latin-alphabet graphemes); y corrected to ["ə","ʔ"] (presyllable schwa/glottal, not palatal glide); allophones dʒ/ɲ/pʰ/tʰ/ə/ʔ added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): CJK grapheme corrections (zh/ko/ja) zh: renamed "en-GB" key to "en" (was a JSON parse artifact); added iao->["iau"] and uai->["uai"] (standard pinyin finals missing from spec). ko: ㄺ corrected to ["k"] and ㄿ to ["p"] (final-cluster jamo values that contradict standard Korean codaification; ㄺ final clusters simplify to /k/, not /l/). ja: added 33 katakana yōon digraphs (キャ/キュ/キョ…リョ) mirroring the existing hiragana yōon entries; katakana is standard Japanese orthography and was entirely absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): stage xsb deletion and pt-MO ancestor reference update Delete xsb.json (superseded by gem-x-suebi.json); update pt-MO.json ancestor reference from mcm to mzs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(g2p): positional and stress-conditioned candidate selection Teach G2P._transcribe_word to resolve each grapheme's IPA candidates through the spec's positional_graphemes table before beam search, using a priority-ordered context list: before_X > word_initial/word_final > intervocalic > nucleus_stressed/nucleus_unstressed > after/before vowel > DEFAULT. Vowel positions are grounded in detect_stress output so unstressed nuclei use the reduction rules already present in the data. For greedy search the positional winner ranks first (score 0); for beam search positional candidates are prepended to the base-table alternatives so the full beam space is preserved. Languages without positional_graphemes fall through to the unchanged flat-table path, keeping all other languages unaffected. Data (pt-PT): add explicit 'a' positional block with nucleus_unstressed→ɐ (unstressed /a/ reduction), word_final/coda/pretonic/posttonic→ɐ, and nucleus_stressed→a; fix nucleus_unstressed for 'o' to map to 'u' (EP raising); suppress the inherited b/d/g intervocalic lenition rules since the phonetic-lexicon gold uses phonemic (non-lenited) forms — allophonic lenition is already captured in the allophones map. pt-PT PER (portuguese_phonetic_lexicon_pt_PT, --strip-stress --broad): before: 0.4976 after: 0.1658 (generalises: 0.1778 at 600 words) gl PER (wikipron_gl, --fold-allophones): 0.2735 → 0.2735 (no regression) Tests: extend test_g2p_engine.py with TestPositionalSelection covering positional candidate use (c-before-i, final-a reduction, unstressed e/o, intervocalic s), beam-still-exposes-alternatives, greedy==beam(1) invariant, and no-op for languages without positional data. Update test_iberian.py: rephrase three pt-PT b/d/g tests to assert the lenition rules are suppressed at the phonemic transcription level. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(data): mirandese unstressed reduction and nasal-digraph demunching - a/e/o gain nucleus and word-final reduction positions mirroring the western-Iberian pattern (a->ɐ, e->ɨ, o->u unstressed) - the an/en/in/on/un nasal digraphs only close when the n is in coda: before a vowel they resolve to the oral vowel plus onset n (abandono -> abɐ̃donu, not abɐ̃dõu) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human review requested!