Release 1.13.0a1 by github-actions[bot] · Pull Request #73 · TigreGotico/orthography2ipa

github-actions · 2026-06-12T16:48:21Z

Human review requested!

* feat: positional graphemmes * feat: positional graphemes * feat: positional graphemes

* reconstruct latin graphemes * fix(pt-PT):model 4-way sibilant distinction * feat: add pt-BR * feat: more positional contexts * refactor: drop redundant "default" from positional mappings * allow trema in pt-BR graphemes

* feat: asturian * feat: galician * feat: galician

…fix type hints and metadata - Create PLAN.md: architecture overview, planned phases, data roadmap - Create TODO.md: prioritised task list (blocking → low) - Create QUICK_FACTS.md: package identity, key classes, quick usage examples - Create AUDIT.md: known issues with file:line citations, CI gaps, tech debt - Create SUGGESTIONS.md: 10 proposals for refactors and enhancements - feats.py:39 — add Dict, Optional to typing imports - feats.py:56 — annotate phone_features: Dict[str, List[Optional[bool]]] - json_loader.py:115 — clarify self-reference cycle comment (was TODO - error log, illegal) - pyproject.toml:8 — update description from "20+ languages" to "308+ language codes" All 7375 tests pass. No behavioral changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds tests/test_iberian.py with extensive per-language test classes for all Iberian Peninsula languages, plus a per-language coverage reporter in conftest.py. Languages covered (15 classes, 485 tests): es-ES 86 tests — graphemes, allophones, positional rules, isoglosses pt-PT 62 tests — null graphemes, sandhi, /v/ preservation, schwa ca 71 tests — ela geminada, vowel reduction, digraphs, diphthongs gl 57 tests — seseo, nasal vowels, null lh/nh/ç, apical s eu 47 tests — sibilant contrast (s̺/s̻), affricates, phonemic h ast 50 tests — distinción, x→ʃ isogloss, ll→ʎ, aspirated h notation an 59 tests — /v/ preservation, seseo, affricates ts/dz, ix→ʃ mwl 13 tests — inheritance from ast-PT-x-medieval, ancestry dialects 40 tests — es-AR, es-ES-x-andalusia-e, ca-x-valencia, ca-x-balear, pt-BR, gl-x-occidental Cross-language isogloss tests (10): distinción vs seseo, /v/ preservation, ll realisation, ch realisation, rr uvular vs alveolar, h silent vs phonemic, apical vs predorsal s, ast x→ʃ vs es x→ks, phonological distance clustering, Basque isolation Coverage reporter (conftest.py): pytest_terminal_summary prints a per-language pass/fail/total/% table at the end of any run touching test_iberian.py. Run with: pytest tests/test_iberian.py -v --tb=short Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sources - Add LinguisticSource frozen dataclass to types.py - Add sources: Tuple[LinguisticSource, ...] to LanguageSpec - Update json_loader.py to parse sources array - Update SCHEMA.md to document sources field - Add sources arrays to 33 Germanic language JSON files: en-GB, en-US, en-AU, en-CA, en-IE, en-ZA, en-GB-x-scotland, de-DE, de-AT, de-CH, nl, nl-NL, nl-BE, sv, sv-x-rikssvenska, nb, nn, no, da, da-x-copenhagen, is, fo, af, nds, enm, ang, non, osx, goh, gem, gem-x-ingvaeonic, gem-x-north, gem-x-northwest - Add tests/test_sources.py (marked @pytest.mark.linguistic) - Create docs/bibliography.md with Phase 1 sources - Update PLAN.md and TODO.md with audit phase tracking - Update MAINTENANCE_REPORT.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dDistance, positional_divergence Part A — bug fixes and hardening: - segment_distance(strict=True) raises ValueError for unknown IPA segments - _build_ancestor_graph() detects circular ancestry and raises ValueError - _get_ancestry_weights_by_code() cached with lru_cache(maxsize=256) via thin wrapper Part B — new metrics: - phoneme_coverage(spec_native, spec_target) -> float (asymmetric L2 transfer estimate) - WeightedDistance frozen dataclass added to types.py - weighted_full_distance() single entry-point with configurable w_inventory/grapheme/allophone/ancestry - positional_divergence() measures positional-override divergence between two specs Part C — tests & docs: - 13 new tests in tests/test_distance.py (all pass, no regressions) - docs/distance.md extended with sections for all new functions and weight-tuning guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New test files covering 9 language families (956 tests total): - test_germanic.py: de-DE, de-AT, Bavarian, nl-NL, nl-BE, af, sv, da, nb, is - test_celtic.py: cy, ga, gd, br, gv, kw - test_slavic.py: ru, pl, cs, bg, sk, uk, be, hr/sl/sr/mk - test_romance_extended2.py: Italian dialects, Romanian, Sardinian, Aranese, Caribbean Spanish, Medieval Spanish, Brazilian/Portuguese dialects - test_indo_iranian.py: hi, sa, fa, fa-x-tehran, fa-AF, tr - test_arabic.py: arb, ar-x-mashriqi, ar-x-maghrebi, ar-MA, ar-x-gulf, ar-IQ - test_other_languages.py Also add germanic/celtic/slavic pytest markers to conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AUDIT.md: mark resolved items (feats.py type annotations, json_loader.py comment, pyproject.toml description, en-GB.json, LinguisticSource); restructure open issues; update date to 2026-03-17. MAINTENANCE_REPORT.md: add transparency report for multi-family language test suites session (Germanic/Celtic/Slavic/Romance/Indo-Iranian/Arabic, +956 tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removes stale en/es/fr/pt-BR sections that exercised the old dict-based grapheme API; replaces with a minimal pt-PT demo compatible with the current list-based grapheme structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

02_distance_metrics.py — segment features, inventory/grapheme/allophone distances, ancestry similarity, phoneme_coverage, weighted_full_distance, pairwise matrix 03_tokenizer.py — PhonetokTokenizer: maximal-munch segmentation, TokenKind, ipa_beam with allophone expansion, multi-language comparison 04_dialect_transforms.py — DIALECT_PROFILES inspection, apply_transform, debias_lisbon, cross-dialect word comparison for Portuguese 05_script_distance.py — SCRIPT_REGISTRY, ScriptFeatures, pairwise script distance matrix, closest/farthest pairs, feature analysis 06_sandhi.py — SandhiEngine, French liaison rules, obligatory_only mode, custom Sanskrit sandhi rules, languages-with-sandhi survey Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…an (6 files) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…52) Part 1 – stress blocks: - tr/az/tk/uz: word-final default (-1); Latin Turkic, no marked_vowels - kk/ky/tt/ba: word-final default (-1); Cyrillic Turkic, no marked_vowels - id/ms: penultimate default (-2); Austronesian, notes honest about schwa-penult variation - he: milra/final default (-1); notes document milel exceptions not modeled Part 2 – deferred data gaps: - cel: remove ē (absent from Proto-Celtic per Matasović 2009); add diphthongs ai/ei/oi/au/ou - si: add unaspirated tʃ/dʒ candidates for ඡ/ඣ (aspiration orthographic only in modern Sinhala) - mni: ancestor note clarified — Meiteic is sister branch of Kuki-Chin, not daughter (VanBik 2009) - mni-x-proto-kuki-chin: remove Meitei/Manipuri from descendant list; add sister-branch note - nds: add positional_graphemes g (word-initial [ɡ], intervocalic [ɣ], coda/final [x]) - tcy: add ĕ→ɛ and ŭ→ɨ (distinctive Tulu vowels beyond Sanskrit/Kannada grid) Tests: 91 new in tests/test_stress_other.py (stress gold cases + deferred assertions) Suite: 15662 passed, 5 skipped; 387 valid Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…ment (#58) * feat(benchmarks): espeak-ng agreement bench for TTS front-end replacement - scripts/espeak_agreement.py measures symbol-level compatibility with espeak-ng output on shared word lists: exact and stress-blind exact match, segmental similarity, and the out-of-inventory symbol rate that decides whether a TTS model trained on espeak phonemes can accept this engine as a drop-in front-end - docs/benchmarks.md documents the methodology and reference table, explicitly framed as agreement, not accuracy - fix(data): the deletion marker in candidate lists is the empty string per the schema; 59 specs carried a null-sign symbol instead, which leaked as a literal character into engine output * test: deletion-candidate assertions use the empty-string marker * fix(data): align deletion-marker prose and normalize ascii g in IPA values - ext notes describe coda-s deletion without the retired null sign - five specs carried ascii g inside IPA candidate values (palatalized and labialized clusters); normalized to the IPA script g the inventories use

… research (#63) * feat(data): european portuguese regional dialects grounded in dialect research Ground seven EP regional dialect specs in DIALECT_PATTERNS.md feature matrix and whitepaper5 IPA transform inventory (TigreGotico internal dialect research). Add 250-sentence gold fixture (ep_dialect_sentences.csv), benchmark loader with dialect-code mapping, and 25 signature-word tests. Dialect features encoded (categorical rules only; sporadic/lexical items in notes): - pt-PT-x-lisbon: ei→ɐj (Lisbon diphthong lowering), ou→o monophthong, unstressed a/o positional reductions - pt-PT-x-porto: v→b betacism (categorical merger), ou→ow diphthong preservation, ei→ej preservation (not lowered to ɐj); apicoalveolar sibilant allophones retained - pt-PT: ou→o as primary candidate (conservative standard default) - pt-PT-x-alentejo: intervocalic d→∅ deletion (positional_graphemes), ei→e monophthong, ou→o, meu digraph simplification - pt-PT-x-algarve: ei→e monophthong, ou→o, word-final/coda s→ʒ sibilant voicing (distinct from Lisbon ʃ) - pt-PT-x-madeira: ões→õns, ães→ɐ̃ns nasal-diphthong→nasal+n (digraph graphemes) - pt-PT-x-acores: u/ú→y (São Miguel fronted-u, categorical grapheme override), ou→ow preservation, ões/ães→nasal+n (shared with Madeira) Notes-only (schema limit or lexical/sporadic): l-palatalization (quilo→quilho, Filipa→fʎipɐ), Madeira r-diphthongization, Açores boi→bô / e→i extreme reduction, Porto-style /e/→[je] diphthongization. All sources updated with TigreGotico internal dialect research entry. Schema validation: 390/390 passed. Full suite: 15847 passed, 0 regressions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(benchmark): csv-module parsing and typed annotations in the ep loader --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…ources (#65) * fix(data): asturleonese family and barranquenho grounded in primary sources Ground all Asturleonese family specs and Barranquenho in published primary sources per extraction notes from Morala & Egido 2009, Propuesta de Norma Ortográfica, Frías-Conde El Habla de Sanabria, Macias 2003 Dialecto rionorês, and the Convenção Ortográfica do Barranquenho 2025 + Gramática Básica 2025. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(data): tie-bar affricate notation and barranquenho nasal em/en encoding - ast-x-occidental che vaqueira uses the tie-bar affricate matching its parent inventory - barranquenho em/en carry positional nasal overrides (word-final, coda, pre-consonant → ẽ per Convenção p. 26: tempu, quen) with the oral default preserved intervocalically; the misnamed test now asserts the em/en behaviour it documents --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Extend the benchmark harness with 28 new WikiPron language entries (it, fr, de, nl, pl, fi, ro, ast, oc, sv, da, nb, is, cy, ga, gd, el, hy, sk, hr, sq, tr, eu, tl, eo, hi, ta, ml) and a new ipadict dataset loader for Icelandic (~60k entries, Hjal project, CC BY 3.0). All added WikiPron TSVs are from CUNY-CL/wikipron data/scrape/tsv/, community-curated by Wiktionary editors (CC-BY-SA). The ipadict is loader strips the /slashes/ from the source format and falls through the same normalise+evaluate path. Rejected/excluded: ipa-dict fi/es/ar/fa/fr-QC/vi (tool-generated per README), Lexique (custom notation not covered by scriptconv), NST (no stable download URL), CELEX2/GlobalPhone (proprietary licenses). Reference numbers for all 40 dataset×lang rows added to docs/benchmarks.md. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* feat(data): portuguese dialect stress and reduction systems Add stress blocks (Acordo Ortográfico 1990) to all 32 Portuguese variant specs that lacked them: pt-AO, pt-MZ, pt-TL, pt-CV, pt-GW, pt-ST, pt-MO, 12 pt-PT-x-* regional varieties, and 12 pt-BR-x-* regional varieties. Also add positional-vowel and consonant overrides for the three African/ Timorese varieties with benchmark gold sets, and fix four notes-vs-data contradictions in PT-PT regional specs. PER trajectories (portuguese_lexicon, n=300): pt-AO 0.398 → 0.180 (−55%) pt-MZ 0.288 → 0.240 (−17%) pt-TL 0.487 → 0.222 (−54%) pt-PT 0.167 → 0.167 (unchanged) pt-BR 0.290 → 0.290 (unchanged) Key data changes: - pt-AO: full five-vowel system (pretonic a→a not ɐ); r→ɾ positional (alveolar trill/flap throughout); word-final o→ʊ; alveolar sibilants. - pt-MZ: e nucleus_unstressed→e (not ɨ; notes: "unstressed /e/→[e] not [ɨ]"); r positional alveolar ɾ; l coda→w (notes: "l in coda often vocalised to [w]"); alveolar sibilants; a reduction kept EP-like (notes: "follows European norms more closely than Angola"). - pt-TL: ə as the key unstressed vowel in all positions (notes: "FIVE-VOWEL SYSTEM: /ɨ/ absent; full unstressed vowels" + Tetum open-syllable contact); ɨ→ə allophone added; r→ɾ/r; alveolar sibilants; l coda→l (no velarisation). - pt-CV/pt-GW/pt-ST/pt-MO: stress blocks only (no golds — no positional speculation beyond what notes document). Notes-vs-data contradictions fixed: 1. pt-PT-x-acores: notes "5. /a/ in unstressed syllables realised as [a] rather than mainland [ɐ]" — added positional_graphemes a:pretonic→a (was inheriting pt-PT's ɐ). 2. pt-PT-x-algarve: notes "1. minimal unstressed vowel reduction — /e/ and /o/ are often preserved as full vowels in unstressed syllables" — added positional_graphemes e/o: nucleus_unstressed→e/o (was inheriting full EP reduction). 3. pt-PT-x-trasosmontes: notes "1. most systematic retroflex sibilants… /s/ and /z/ realised as [ʂ] and [ʐ] in nearly all environments including onset, coda, and intervocalic" — added positional_graphemes s/z overriding to ʂ/ʐ (spec had only allophones, no positional rule). 4. pt-PT-x-viana: notes "Retroflex sibilants [ʂ, ʐ] in final position" — added positional_graphemes s:coda/word_final→ʂ (spec had allophones but no positional rule; coda sibilants were falling through to parent pt-PT's ʃ). 5. pt-PT-x-medieval: stress marked_vowels set to [à] only — modern à is the only accent documented in medieval orthography; modern acutes/circumflexes were not grapheme-mapped in this spec (failed TestStressIntegration.test_marked_vowels). Add tests/test_stress_pt_dialects.py: 80 new tests covering stress placement for all seven core African/Timorese/Macanese varieties plus parametrised pt-PT-x-* and pt-BR-x-* regional stress + the four notes-fix cases. 387 valid specs; 15902 passed, 5 skipped. * fix(benchmark): measure the local checkout, not the installed package

Add major missing Arabic dialects and a new Levantine intermediate node; correct the Hejazi qaf reflex; add a per-dialect gold transcription suite. New specs: - ar-x-levantine (intermediate under ar-x-mashriqi): qaf->ʔ urban, jim->ʒ, interdentals variable - ar-EG (Cairene): jim->ɡ, qaf->ʔ, interdental merger, čeh loan grapheme->ʒ - ar-SY, ar-LB, ar-JO, ar-PS (Levantine children, country signatures) - ar-SD (Sudanese): qaf->ɡ, jim->ɟ palatal, interdental merger Fix: - ar-SA-x-hejaz qaf ʔ->ɡ (chain shift q->g->dʒ; ʔ was a misattributed Levantine/Egyptian feature) Tests: - tests/test_arabic_dialects.py (59 items): word-level consonant reflexes via the engine + grapheme-level variant/merger checks Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* feat(data): align pt-BR base positional rules with JIPA São Paulo variety * feat(data): brazilian portuguese dialect expansion Wire all 12 pt-BR-x-* dialect specs to inherit the JIPA-aligned pt-BR base (graphemes/allophones/positional_graphemes) and add own stress blocks, plus defensible per-dialect positional overrides: - sp: JIPA São Paulo reference (Barbosa & Albano 2004), base unchanged - rj/fluminense: coda /S/ chiado [ʃ,ʒ] + dorsal onset /r/ [x~χ~h] - caipira: retroflex coda /r/ [ɻ] - mg: pretonic raising /e,o/ -> [i,u]; plain coda /S/ - sul/pr: conservative non-palatalisation default + final /e/ -> [e]; sul alveolar-trill onset /r/ - recife: NE open pretonic + NO palatalisation + coda /S/ chiado - norte: NE open pretonic + coda /S/ chiado, palatalisation retained (ALiB) - bahia: open pretonic, palatalisation before /i/ retained, plain coda /S/ - ce: open pretonic, palatalisation retained (ALiB), coda /S/ aspiration [h] - brasilia: levelling koiné, base unchanged Add tests/test_br_dialects.py with per-spec signature assertions and the JIPA North-Wind reductions as pt-BR gold. Update the pt-BR reduction guards in test_stress_romance.py to the JIPA narrow values [ɪ,ʊ]. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

) Ground mwl and mwl-x-sendim against the Convenção Ortográfica da Língua Mirandesa (1999, Ferreira & Raposo) and the Primeiro Aditamento (February 2000). Changes per document: - mwl: add iê/ie → /jɛ/ and uô → /wɔ/ (Convenção § Ditongos crescentes; open-mid quality in stressed position confirmed by gold benchmark) - mwl: add positional_graphemes b/d with intervocalic candidates ranked after stops (Convenção § B/D; gold dataset drives stop-first ranking) - mwl: update notes to document nasal-digraph scope and initial-l convention - mwl: add Convenção URL and Primeiro Aditamento as sources entries - mwl-x-sendim: add positional_graphemes l with word_initial → /l/ (not /ʎ/), grounded in Aditamento (2000) explicit provision for Sendinese - mwl-x-sendim: replace generic sources with convencao1999 + aditamento2000 - mwl-x-sendim: update notes with Aditamento provenance - tests/test_mirandese_convention.py: 36 tests covering § L, § Sibilantes, § Ditongos, § Nasalidade, § B/D, § Acento, and Aditamento Sendinese features PER trajectory (gold: TigreGotico/mirandese_g2p): mwl: 0.2299 → 0.2244 (-0.0055) mwl-x-sendim: 0.5126 → 0.4899 (-0.0227) Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

JarbasAl and others added 30 commits February 21, 2026 16:34

feat: positional graphemmes (#4)

6438b2d

* feat: positional graphemmes * feat: positional graphemes * feat: positional graphemes

Increment Version to

94959b7

Update Changelog

bbee58b

refactor to json

a7ab6da

Increment Version to

44ba8a6

Update Changelog

270c78d

Delete dump/tests directory

f42d5aa

Increment Version to

c751f91

Update Changelog

14cf27d

feat: ast+gl (#8)

51d7b0e

* feat: asturian * feat: galician * feat: galician

Increment Version to

f4f0ad0

Update Changelog

81ebbc1

feat: castillian + aragonese

051bc62

fix: extremaduran / cantabrian

2899f13

fix: leonese

609be4e

chore: add langcodes to requirements.txt; add uv.lock

55f05cc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Afroasiatic (5 files)

db30134

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Aragonese (1 file)

ad4fbd9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Asturleonese (12 files)

95890bc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Austroasiatic + Austronesi…

af894fc

…an (6 files) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(sources): add bibliographic sources — Celtic (15 files)

fcd7ee9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JarbasAl and others added 30 commits June 11, 2026 13:55

Update Changelog

5a0180f

Increment Version to 1.6.0a1

472dc0b

docs: credit the mirandese gold set's native-speaker provenance (#55)

a059661

Increment Version to 1.6.0a2

bc510ee

Update Changelog

387b3cf

Increment Version to 1.7.0a1

78eccf5

Update Changelog

5cbc035

Increment Version to 1.8.0a1

d4c8859

Update Changelog

93b9ef3

Increment Version to 1.8.1a1

5387d17

Update Changelog

10c9bd2

Increment Version to 1.9.0a1

5bc7758

Update Changelog

f634ac8

Increment Version to 1.10.0a1

ababe12

Update Changelog

2e3bf77

Increment Version to 1.11.0a1

e773ddf

Update Changelog

da53b78

Increment Version to 1.12.0a1

a50e29a

Update Changelog

3bc3c20

Increment Version to 1.13.0a1

008ea3e

Update Changelog

0541988

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.13.0a1#73

Release 1.13.0a1#73
github-actions[bot] wants to merge 178 commits into
masterfrom
release-1.13.0a1

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants