Skip to content

Release 1.10.0a1#69

Open
github-actions[bot] wants to merge 170 commits into
masterfrom
release-1.10.0a1
Open

Release 1.10.0a1#69
github-actions[bot] wants to merge 170 commits into
masterfrom
release-1.10.0a1

Conversation

@github-actions

Copy link
Copy Markdown

Human review requested!

JarbasAl and others added 30 commits February 21, 2026 16:34
* feat: positional graphemmes

* feat: positional graphemes

* feat: positional graphemes
* reconstruct latin graphemes

* fix(pt-PT):model 4-way sibilant distinction

* feat: add pt-BR

* feat: more positional contexts

* refactor: drop redundant "default" from positional mappings

* allow trema in pt-BR graphemes
* feat: asturian

* feat: galician

* feat: galician
…fix type hints and metadata

- Create PLAN.md: architecture overview, planned phases, data roadmap
- Create TODO.md: prioritised task list (blocking → low)
- Create QUICK_FACTS.md: package identity, key classes, quick usage examples
- Create AUDIT.md: known issues with file:line citations, CI gaps, tech debt
- Create SUGGESTIONS.md: 10 proposals for refactors and enhancements
- feats.py:39 — add Dict, Optional to typing imports
- feats.py:56 — annotate phone_features: Dict[str, List[Optional[bool]]]
- json_loader.py:115 — clarify self-reference cycle comment (was TODO - error log, illegal)
- pyproject.toml:8 — update description from "20+ languages" to "308+ language codes"

All 7375 tests pass. No behavioral changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/test_iberian.py with extensive per-language test classes for
all Iberian Peninsula languages, plus a per-language coverage reporter
in conftest.py.

Languages covered (15 classes, 485 tests):
  es-ES     86 tests  — graphemes, allophones, positional rules, isoglosses
  pt-PT     62 tests  — null graphemes, sandhi, /v/ preservation, schwa
  ca        71 tests  — ela geminada, vowel reduction, digraphs, diphthongs
  gl        57 tests  — seseo, nasal vowels, null lh/nh/ç, apical s
  eu        47 tests  — sibilant contrast (s̺/s̻), affricates, phonemic h
  ast       50 tests  — distinción, x→ʃ isogloss, ll→ʎ, aspirated h notation
  an        59 tests  — /v/ preservation, seseo, affricates ts/dz, ix→ʃ
  mwl       13 tests  — inheritance from ast-PT-x-medieval, ancestry
  dialects  40 tests  — es-AR, es-ES-x-andalusia-e, ca-x-valencia, ca-x-balear,
                        pt-BR, gl-x-occidental

Cross-language isogloss tests (10):
  distinción vs seseo, /v/ preservation, ll realisation, ch realisation,
  rr uvular vs alveolar, h silent vs phonemic, apical vs predorsal s,
  ast x→ʃ vs es x→ks, phonological distance clustering, Basque isolation

Coverage reporter (conftest.py):
  pytest_terminal_summary prints a per-language pass/fail/total/% table
  at the end of any run touching test_iberian.py.

Run with: pytest tests/test_iberian.py -v --tb=short

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sources

- Add LinguisticSource frozen dataclass to types.py
- Add sources: Tuple[LinguisticSource, ...] to LanguageSpec
- Update json_loader.py to parse sources array
- Update SCHEMA.md to document sources field
- Add sources arrays to 33 Germanic language JSON files:
  en-GB, en-US, en-AU, en-CA, en-IE, en-ZA, en-GB-x-scotland,
  de-DE, de-AT, de-CH, nl, nl-NL, nl-BE, sv, sv-x-rikssvenska,
  nb, nn, no, da, da-x-copenhagen, is, fo, af, nds, enm, ang,
  non, osx, goh, gem, gem-x-ingvaeonic, gem-x-north, gem-x-northwest
- Add tests/test_sources.py (marked @pytest.mark.linguistic)
- Create docs/bibliography.md with Phase 1 sources
- Update PLAN.md and TODO.md with audit phase tracking
- Update MAINTENANCE_REPORT.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dDistance, positional_divergence

Part A — bug fixes and hardening:
- segment_distance(strict=True) raises ValueError for unknown IPA segments
- _build_ancestor_graph() detects circular ancestry and raises ValueError
- _get_ancestry_weights_by_code() cached with lru_cache(maxsize=256) via thin wrapper

Part B — new metrics:
- phoneme_coverage(spec_native, spec_target) -> float  (asymmetric L2 transfer estimate)
- WeightedDistance frozen dataclass added to types.py
- weighted_full_distance() single entry-point with configurable w_inventory/grapheme/allophone/ancestry
- positional_divergence() measures positional-override divergence between two specs

Part C — tests & docs:
- 13 new tests in tests/test_distance.py (all pass, no regressions)
- docs/distance.md extended with sections for all new functions and weight-tuning guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New test files covering 9 language families (956 tests total):
- test_germanic.py: de-DE, de-AT, Bavarian, nl-NL, nl-BE, af, sv, da, nb, is
- test_celtic.py: cy, ga, gd, br, gv, kw
- test_slavic.py: ru, pl, cs, bg, sk, uk, be, hr/sl/sr/mk
- test_romance_extended2.py: Italian dialects, Romanian, Sardinian, Aranese,
  Caribbean Spanish, Medieval Spanish, Brazilian/Portuguese dialects
- test_indo_iranian.py: hi, sa, fa, fa-x-tehran, fa-AF, tr
- test_arabic.py: arb, ar-x-mashriqi, ar-x-maghrebi, ar-MA, ar-x-gulf, ar-IQ
- test_other_languages.py

Also add germanic/celtic/slavic pytest markers to conftest.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AUDIT.md: mark resolved items (feats.py type annotations, json_loader.py
comment, pyproject.toml description, en-GB.json, LinguisticSource); restructure
open issues; update date to 2026-03-17.

MAINTENANCE_REPORT.md: add transparency report for multi-family language test
suites session (Germanic/Celtic/Slavic/Romance/Indo-Iranian/Arabic, +956 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes stale en/es/fr/pt-BR sections that exercised the old dict-based
grapheme API; replaces with a minimal pt-PT demo compatible with the current
list-based grapheme structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
02_distance_metrics.py — segment features, inventory/grapheme/allophone
  distances, ancestry similarity, phoneme_coverage, weighted_full_distance,
  pairwise matrix
03_tokenizer.py — PhonetokTokenizer: maximal-munch segmentation, TokenKind,
  ipa_beam with allophone expansion, multi-language comparison
04_dialect_transforms.py — DIALECT_PROFILES inspection, apply_transform,
  debias_lisbon, cross-dialect word comparison for Portuguese
05_script_distance.py — SCRIPT_REGISTRY, ScriptFeatures, pairwise script
  distance matrix, closest/farthest pairs, feature analysis
06_sandhi.py — SandhiEngine, French liaison rules, obligatory_only mode,
  custom Sanskrit sandhi rules, languages-with-sandhi survey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…an (6 files)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JarbasAl and others added 30 commits June 11, 2026 14:46
- scripts/benchmark.py evaluates the engine against the gold sets
  (Portal da Lingua Portuguesa lexicon via tugalex, WikiPron, CMUdict
  via scriptconv, the Mirandese gold set) with multi-reference
  segmentation-free PER/WER
- docs/benchmarks.md documents each dataset's source and provenance
  tier, the methodology, and the reproducible reference table
- README links the benchmarks page
…cks (#50)

* feat(data): from-start stress anchoring and fixed-stress language blocks

Extend StressRules.default_position to accept positive values 1..2
(from-start anchoring; 1=first syllable, 2=second syllable) in addition
to the existing -4..-1 end-anchored range; 0 is explicitly rejected.
Update the pydantic validator, StressRules docstring, detect_stress
(positive → min(pos-1, n-1) index from start), and extend _VOWELS in
syllabify to cover Greek vowel characters.

Seed stress blocks for 15 fixed/predictable-stress languages:
- Initial stress (default_position 1): cs, sk, fi, et, hu, lv, is, hsb, dsb
- Penultimate stress (default_position -2): eo, pl, sw, szl, csb
- Written-accent stress (marked_vowels): el (ά έ ή ί ό ύ ώ ΐ ΰ)

Add dialytika-tonos graphemes ΐ ΰ to el to satisfy the
marked_vowels ⊆ graphemes invariant. Add 130+ new gold stress test
cases across all seeded languages and 8 schema-level unit tests for
the positive-position extension.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(stress): cover extended-Latin vowels in the naive syllabifier

- polish nasal hooks, hungarian double acutes, czech/slovak hacek and
  ring vowels, baltic macrons/ogoneks and dotless i now count as
  nuclei (dziekuje syllabifies dzie-ku-je)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…51)

Add declarative stress blocks to 13 Romance specs that lacked them:
- es-ES, es-419, es-MX, es-AR: Iberian accentuation (default paroxytone;
  r/l/d/z/j/x oxytone attractors; n/s penult; written accents win)
- it-IT: default paroxytone; marked vowels à è é ì ò ó ù override;
  honest note that unaccented proparoxytones (tavola) are lexical
- ca: default paroxytone; consonant-final (r l n t m p k …) oxytone;
  s-final paroxytone; marked vowels à è é í ï ò ó ú ü win
- an (Aragonese), ast (Asturian): Iberian pattern analogous to Spanish
- oc, oc-x-aranes: classical Occitan norm — default oxytone, vowel/s
  final → paroxytone, infinitive -ar/-er/-ir penult; accents win
- mwl-x-sendim, mwl-x-ifanes: copy mwl block (no stress divergence)
- ro-RO: SKIPPED — Romanian stress is fully lexical, no orthographic cues

Add Brazilian Portuguese positional vowel reduction (Part 2):
- word_final: e→i, o→u, a→ɐ (categorical)
- posttonic non-final: e→i, o→u, a→ɐ
- pretonic (nucleus_unstressed): o stays o (weaker than EP, not u-first)
pt-BR PER: 0.42 → 0.29 (target <0.30 reached); pt-PT PER unchanged at 0.17

Add tests/test_stress_romance.py: 57 gold stress cases + pt-BR
reduction invariant checks.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…52)

Part 1 – stress blocks:
- tr/az/tk/uz: word-final default (-1); Latin Turkic, no marked_vowels
- kk/ky/tt/ba: word-final default (-1); Cyrillic Turkic, no marked_vowels
- id/ms: penultimate default (-2); Austronesian, notes honest about schwa-penult variation
- he: milra/final default (-1); notes document milel exceptions not modeled

Part 2 – deferred data gaps:
- cel: remove ē (absent from Proto-Celtic per Matasović 2009); add diphthongs ai/ei/oi/au/ou
- si: add unaspirated tʃ/dʒ candidates for ඡ/ඣ (aspiration orthographic only in modern Sinhala)
- mni: ancestor note clarified — Meiteic is sister branch of Kuki-Chin, not daughter (VanBik 2009)
- mni-x-proto-kuki-chin: remove Meitei/Manipuri from descendant list; add sister-branch note
- nds: add positional_graphemes g (word-initial [ɡ], intervocalic [ɣ], coda/final [x])
- tcy: add ĕ→ɛ and ŭ→ɨ (distinctive Tulu vowels beyond Sanskrit/Kannada grid)

Tests: 91 new in tests/test_stress_other.py (stress gold cases + deferred assertions)
Suite: 15662 passed, 5 skipped; 387 valid

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ment (#58)

* feat(benchmarks): espeak-ng agreement bench for TTS front-end replacement

- scripts/espeak_agreement.py measures symbol-level compatibility with
  espeak-ng output on shared word lists: exact and stress-blind exact
  match, segmental similarity, and the out-of-inventory symbol rate
  that decides whether a TTS model trained on espeak phonemes can
  accept this engine as a drop-in front-end
- docs/benchmarks.md documents the methodology and reference table,
  explicitly framed as agreement, not accuracy
- fix(data): the deletion marker in candidate lists is the empty
  string per the schema; 59 specs carried a null-sign symbol instead,
  which leaked as a literal character into engine output

* test: deletion-candidate assertions use the empty-string marker

* fix(data): align deletion-marker prose and normalize ascii g in IPA values

- ext notes describe coda-s deletion without the retired null sign
- five specs carried ascii g inside IPA candidate values (palatalized
  and labialized clusters); normalized to the IPA script g the
  inventories use
… research (#63)

* feat(data): european portuguese regional dialects grounded in dialect research

Ground seven EP regional dialect specs in DIALECT_PATTERNS.md feature matrix
and whitepaper5 IPA transform inventory (TigreGotico internal dialect research).
Add 250-sentence gold fixture (ep_dialect_sentences.csv), benchmark loader with
dialect-code mapping, and 25 signature-word tests.

Dialect features encoded (categorical rules only; sporadic/lexical items in
notes):

- pt-PT-x-lisbon: ei→ɐj (Lisbon diphthong lowering), ou→o monophthong,
  unstressed a/o positional reductions
- pt-PT-x-porto: v→b betacism (categorical merger), ou→ow diphthong
  preservation, ei→ej preservation (not lowered to ɐj); apicoalveolar
  sibilant allophones retained
- pt-PT: ou→o as primary candidate (conservative standard default)
- pt-PT-x-alentejo: intervocalic d→∅ deletion (positional_graphemes),
  ei→e monophthong, ou→o, meu digraph simplification
- pt-PT-x-algarve: ei→e monophthong, ou→o, word-final/coda s→ʒ sibilant
  voicing (distinct from Lisbon ʃ)
- pt-PT-x-madeira: ões→õns, ães→ɐ̃ns nasal-diphthong→nasal+n (digraph
  graphemes)
- pt-PT-x-acores: u/ú→y (São Miguel fronted-u, categorical grapheme
  override), ou→ow preservation, ões/ães→nasal+n (shared with Madeira)

Notes-only (schema limit or lexical/sporadic): l-palatalization
(quilo→quilho, Filipa→fʎipɐ), Madeira r-diphthongization, Açores
boi→bô / e→i extreme reduction, Porto-style /e/→[je] diphthongization.

All sources updated with TigreGotico internal dialect research entry.
Schema validation: 390/390 passed. Full suite: 15847 passed, 0 regressions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(benchmark): csv-module parsing and typed annotations in the ep loader

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…ources (#65)

* fix(data): asturleonese family and barranquenho grounded in primary sources

Ground all Asturleonese family specs and Barranquenho in published primary
sources per extraction notes from Morala & Egido 2009, Propuesta de Norma
Ortográfica, Frías-Conde El Habla de Sanabria, Macias 2003 Dialecto rionorês,
and the Convenção Ortográfica do Barranquenho 2025 + Gramática Básica 2025.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(data): tie-bar affricate notation and barranquenho nasal em/en encoding

- ast-x-occidental che vaqueira uses the tie-bar affricate matching its
  parent inventory
- barranquenho em/en carry positional nasal overrides (word-final, coda,
  pre-consonant → ẽ per Convenção p. 26: tempu, quen) with the oral
  default preserved intervocalically; the misnamed test now asserts the
  em/en behaviour it documents

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Extend the benchmark harness with 28 new WikiPron language entries
(it, fr, de, nl, pl, fi, ro, ast, oc, sv, da, nb, is, cy, ga, gd,
el, hy, sk, hr, sq, tr, eu, tl, eo, hi, ta, ml) and a new ipadict
dataset loader for Icelandic (~60k entries, Hjal project, CC BY 3.0).

All added WikiPron TSVs are from CUNY-CL/wikipron data/scrape/tsv/,
community-curated by Wiktionary editors (CC-BY-SA). The ipadict is
loader strips the /slashes/ from the source format and falls through
the same normalise+evaluate path.

Rejected/excluded: ipa-dict fi/es/ar/fa/fr-QC/vi (tool-generated per
README), Lexique (custom notation not covered by scriptconv), NST
(no stable download URL), CELEX2/GlobalPhone (proprietary licenses).
Reference numbers for all 40 dataset×lang rows added to
docs/benchmarks.md.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* feat(data): portuguese dialect stress and reduction systems

Add stress blocks (Acordo Ortográfico 1990) to all 32 Portuguese variant
specs that lacked them: pt-AO, pt-MZ, pt-TL, pt-CV, pt-GW, pt-ST, pt-MO,
12 pt-PT-x-* regional varieties, and 12 pt-BR-x-* regional varieties.
Also add positional-vowel and consonant overrides for the three African/
Timorese varieties with benchmark gold sets, and fix four notes-vs-data
contradictions in PT-PT regional specs.

PER trajectories (portuguese_lexicon, n=300):
  pt-AO  0.398 → 0.180  (−55%)
  pt-MZ  0.288 → 0.240  (−17%)
  pt-TL  0.487 → 0.222  (−54%)
  pt-PT  0.167 → 0.167  (unchanged)
  pt-BR  0.290 → 0.290  (unchanged)

Key data changes:
- pt-AO: full five-vowel system (pretonic a→a not ɐ); r→ɾ positional
  (alveolar trill/flap throughout); word-final o→ʊ; alveolar sibilants.
- pt-MZ: e nucleus_unstressed→e (not ɨ; notes: "unstressed /e/→[e] not [ɨ]");
  r positional alveolar ɾ; l coda→w (notes: "l in coda often vocalised to [w]");
  alveolar sibilants; a reduction kept EP-like (notes: "follows European norms
  more closely than Angola").
- pt-TL: ə as the key unstressed vowel in all positions (notes: "FIVE-VOWEL
  SYSTEM: /ɨ/ absent; full unstressed vowels" + Tetum open-syllable contact);
  ɨ→ə allophone added; r→ɾ/r; alveolar sibilants; l coda→l (no velarisation).
- pt-CV/pt-GW/pt-ST/pt-MO: stress blocks only (no golds — no positional
  speculation beyond what notes document).

Notes-vs-data contradictions fixed:
1. pt-PT-x-acores: notes "5. /a/ in unstressed syllables realised as [a] rather
   than mainland [ɐ]" — added positional_graphemes a:pretonic→a (was inheriting
   pt-PT's ɐ).
2. pt-PT-x-algarve: notes "1. minimal unstressed vowel reduction — /e/ and /o/
   are often preserved as full vowels in unstressed syllables" — added
   positional_graphemes e/o: nucleus_unstressed→e/o (was inheriting full EP
   reduction).
3. pt-PT-x-trasosmontes: notes "1. most systematic retroflex sibilants… /s/ and
   /z/ realised as [ʂ] and [ʐ] in nearly all environments including onset, coda,
   and intervocalic" — added positional_graphemes s/z overriding to ʂ/ʐ (spec
   had only allophones, no positional rule).
4. pt-PT-x-viana: notes "Retroflex sibilants [ʂ, ʐ] in final position" — added
   positional_graphemes s:coda/word_final→ʂ (spec had allophones but no
   positional rule; coda sibilants were falling through to parent pt-PT's ʃ).
5. pt-PT-x-medieval: stress marked_vowels set to [à] only — modern à is the only
   accent documented in medieval orthography; modern acutes/circumflexes were not
   grapheme-mapped in this spec (failed TestStressIntegration.test_marked_vowels).

Add tests/test_stress_pt_dialects.py: 80 new tests covering stress placement
for all seven core African/Timorese/Macanese varieties plus parametrised
pt-PT-x-* and pt-BR-x-* regional stress + the four notes-fix cases.

387 valid specs; 15902 passed, 5 skipped.

* fix(benchmark): measure the local checkout, not the installed package
Add major missing Arabic dialects and a new Levantine intermediate node;
correct the Hejazi qaf reflex; add a per-dialect gold transcription suite.

New specs:
- ar-x-levantine (intermediate under ar-x-mashriqi): qaf->ʔ urban, jim->ʒ,
  interdentals variable
- ar-EG (Cairene): jim->ɡ, qaf->ʔ, interdental merger, čeh loan grapheme->ʒ
- ar-SY, ar-LB, ar-JO, ar-PS (Levantine children, country signatures)
- ar-SD (Sudanese): qaf->ɡ, jim->ɟ palatal, interdental merger

Fix:
- ar-SA-x-hejaz qaf ʔ->ɡ (chain shift q->g->dʒ; ʔ was a misattributed
  Levantine/Egyptian feature)

Tests:
- tests/test_arabic_dialects.py (59 items): word-level consonant reflexes
  via the engine + grapheme-level variant/merger checks

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants