Skip to content

Release 0.6.0a1#38

Open
github-actions[bot] wants to merge 113 commits into
masterfrom
release-0.6.0a1
Open

Release 0.6.0a1#38
github-actions[bot] wants to merge 113 commits into
masterfrom
release-0.6.0a1

Conversation

@github-actions

Copy link
Copy Markdown

Human review requested!

JarbasAl and others added 30 commits February 21, 2026 16:34
* feat: positional graphemmes

* feat: positional graphemes

* feat: positional graphemes
* reconstruct latin graphemes

* fix(pt-PT):model 4-way sibilant distinction

* feat: add pt-BR

* feat: more positional contexts

* refactor: drop redundant "default" from positional mappings

* allow trema in pt-BR graphemes
* feat: asturian

* feat: galician

* feat: galician
…fix type hints and metadata

- Create PLAN.md: architecture overview, planned phases, data roadmap
- Create TODO.md: prioritised task list (blocking → low)
- Create QUICK_FACTS.md: package identity, key classes, quick usage examples
- Create AUDIT.md: known issues with file:line citations, CI gaps, tech debt
- Create SUGGESTIONS.md: 10 proposals for refactors and enhancements
- feats.py:39 — add Dict, Optional to typing imports
- feats.py:56 — annotate phone_features: Dict[str, List[Optional[bool]]]
- json_loader.py:115 — clarify self-reference cycle comment (was TODO - error log, illegal)
- pyproject.toml:8 — update description from "20+ languages" to "308+ language codes"

All 7375 tests pass. No behavioral changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/test_iberian.py with extensive per-language test classes for
all Iberian Peninsula languages, plus a per-language coverage reporter
in conftest.py.

Languages covered (15 classes, 485 tests):
  es-ES     86 tests  — graphemes, allophones, positional rules, isoglosses
  pt-PT     62 tests  — null graphemes, sandhi, /v/ preservation, schwa
  ca        71 tests  — ela geminada, vowel reduction, digraphs, diphthongs
  gl        57 tests  — seseo, nasal vowels, null lh/nh/ç, apical s
  eu        47 tests  — sibilant contrast (s̺/s̻), affricates, phonemic h
  ast       50 tests  — distinción, x→ʃ isogloss, ll→ʎ, aspirated h notation
  an        59 tests  — /v/ preservation, seseo, affricates ts/dz, ix→ʃ
  mwl       13 tests  — inheritance from ast-PT-x-medieval, ancestry
  dialects  40 tests  — es-AR, es-ES-x-andalusia-e, ca-x-valencia, ca-x-balear,
                        pt-BR, gl-x-occidental

Cross-language isogloss tests (10):
  distinción vs seseo, /v/ preservation, ll realisation, ch realisation,
  rr uvular vs alveolar, h silent vs phonemic, apical vs predorsal s,
  ast x→ʃ vs es x→ks, phonological distance clustering, Basque isolation

Coverage reporter (conftest.py):
  pytest_terminal_summary prints a per-language pass/fail/total/% table
  at the end of any run touching test_iberian.py.

Run with: pytest tests/test_iberian.py -v --tb=short

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sources

- Add LinguisticSource frozen dataclass to types.py
- Add sources: Tuple[LinguisticSource, ...] to LanguageSpec
- Update json_loader.py to parse sources array
- Update SCHEMA.md to document sources field
- Add sources arrays to 33 Germanic language JSON files:
  en-GB, en-US, en-AU, en-CA, en-IE, en-ZA, en-GB-x-scotland,
  de-DE, de-AT, de-CH, nl, nl-NL, nl-BE, sv, sv-x-rikssvenska,
  nb, nn, no, da, da-x-copenhagen, is, fo, af, nds, enm, ang,
  non, osx, goh, gem, gem-x-ingvaeonic, gem-x-north, gem-x-northwest
- Add tests/test_sources.py (marked @pytest.mark.linguistic)
- Create docs/bibliography.md with Phase 1 sources
- Update PLAN.md and TODO.md with audit phase tracking
- Update MAINTENANCE_REPORT.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dDistance, positional_divergence

Part A — bug fixes and hardening:
- segment_distance(strict=True) raises ValueError for unknown IPA segments
- _build_ancestor_graph() detects circular ancestry and raises ValueError
- _get_ancestry_weights_by_code() cached with lru_cache(maxsize=256) via thin wrapper

Part B — new metrics:
- phoneme_coverage(spec_native, spec_target) -> float  (asymmetric L2 transfer estimate)
- WeightedDistance frozen dataclass added to types.py
- weighted_full_distance() single entry-point with configurable w_inventory/grapheme/allophone/ancestry
- positional_divergence() measures positional-override divergence between two specs

Part C — tests & docs:
- 13 new tests in tests/test_distance.py (all pass, no regressions)
- docs/distance.md extended with sections for all new functions and weight-tuning guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New test files covering 9 language families (956 tests total):
- test_germanic.py: de-DE, de-AT, Bavarian, nl-NL, nl-BE, af, sv, da, nb, is
- test_celtic.py: cy, ga, gd, br, gv, kw
- test_slavic.py: ru, pl, cs, bg, sk, uk, be, hr/sl/sr/mk
- test_romance_extended2.py: Italian dialects, Romanian, Sardinian, Aranese,
  Caribbean Spanish, Medieval Spanish, Brazilian/Portuguese dialects
- test_indo_iranian.py: hi, sa, fa, fa-x-tehran, fa-AF, tr
- test_arabic.py: arb, ar-x-mashriqi, ar-x-maghrebi, ar-MA, ar-x-gulf, ar-IQ
- test_other_languages.py

Also add germanic/celtic/slavic pytest markers to conftest.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AUDIT.md: mark resolved items (feats.py type annotations, json_loader.py
comment, pyproject.toml description, en-GB.json, LinguisticSource); restructure
open issues; update date to 2026-03-17.

MAINTENANCE_REPORT.md: add transparency report for multi-family language test
suites session (Germanic/Celtic/Slavic/Romance/Indo-Iranian/Arabic, +956 tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes stale en/es/fr/pt-BR sections that exercised the old dict-based
grapheme API; replaces with a minimal pt-PT demo compatible with the current
list-based grapheme structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
02_distance_metrics.py — segment features, inventory/grapheme/allophone
  distances, ancestry similarity, phoneme_coverage, weighted_full_distance,
  pairwise matrix
03_tokenizer.py — PhonetokTokenizer: maximal-munch segmentation, TokenKind,
  ipa_beam with allophone expansion, multi-language comparison
04_dialect_transforms.py — DIALECT_PROFILES inspection, apply_transform,
  debias_lisbon, cross-dialect word comparison for Portuguese
05_script_distance.py — SCRIPT_REGISTRY, ScriptFeatures, pairwise script
  distance matrix, closest/farthest pairs, feature analysis
06_sandhi.py — SandhiEngine, French liaison rules, obligatory_only mode,
  custom Sanskrit sandhi rules, languages-with-sandhi survey

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…an (6 files)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JarbasAl and others added 30 commits March 25, 2026 21:17
Languages added: lt (Lithuanian), lv (Latvian), et (Estonian), mt (Maltese),
lb (Luxembourgish), sq (Albanian Tosk), hy (Armenian Eastern), ka (Georgian),
se (Northern Sami), hsb (Upper Sorbian), dsb (Lower Sorbian), csb (Kashubian),
szl (Silesian), rue (Rusyn), wa (Walloon), pcd (Picard), rup (Aromanian),
rom (Romani Vlax), yi (Yiddish), nrf (Norman), pnt (Pontic Greek), el-CY (Cypriot Greek).

All files follow the existing schema: grapheme arrays, allophones, positional_graphemes,
ancestors with role/weight, timespan, ISO 639-3, Glottolog codes, and Wikipedia sources.
Digraphs and trigraphs are listed before single-letter entries per processing order.
Non-Latin scripts covered: Armenian (hy), Georgian (ka), Cyrillic (rue), Hebrew (yi), Greek (pnt, el-CY).

AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Extend European language coverage for orthography2ipa data corpus
- Impact: 22 new research-quality language files covering Baltic, Uralic, Kartvelian, Semitic, Armenian, Slavic minority, Romance minority, Germanic minority, and Greek dialect varieties
- Verified via: python3 JSON validation — all 22 files parse correctly, grapheme counts 27–41 per language
Switch publish/release reusable workflows to OpenVoiceOS/gh-automations
@dev and build from pyproject (drop nonexistent setup.py steps). Add
build-tests, coverage, and license_check reusable workflows; replace the
custom unit_tests workflow with build-tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the parent_dialect, proto_language, ancestor, and related ancestry
roles present in the data so all specs load. Derive an identity allophone
map for specs that declare graphemes but no explicit allophones. Point
distance-test fixtures at the populated en-GB/fr-FR specs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the README as a showcase (grapheme vs allophone maps, tokenizer,
distance metrics, CLI) with verified examples. Add a test optional-extra
for CI. Document the optional ONNX Arabic diacritizer stub and graceful
fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add pydantic v2 models mirroring the spec JSON schema with extra='forbid',
covering graphemes, allophones, positional overrides, ancestors, sandhi
rules, tone inventory, sources, timespan and inheritance keys. Field
validators enforce non-empty identifiers, ancestor weights in [0, 1] and
plausible publication years.

Expose validation via a new `orthography2ipa validate [code] [--json]`
CLI subcommand and a parametrized pytest over every data/*.json spec.
Document the previously undocumented glottolog_code, wikipedia, urls,
timespan, lexicon_csv and source wikipedia_url fields in SCHEMA.md, plus
the full GraphemePosition key set.

Add pydantic>=2 as a validation optional-extra and to the test extra so CI
runs the schema test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the duplicate ISO_639-3 key into the canonical iso639_3 field
(filling it where empty; es-ES-x-medieval keeps the Old Spanish code osp).
Drop the non-canonical IETF_code extension subtag from the two Asturian
specs and strip empty-string url entries. These keys had no consumer and
caused unknown-key validation failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Update phonetic representations of graphemes in an.json

* Fix JSON formatting for 'ü' entry

* Refactor phonetic representations in an.json

Removed several phonetic representations from the JSON data, including 'ɛ', 'ɔ', 'dz', 'dʒ', and 'n'. Added new entries for 'ʎ' and 'θ' in various contexts.

* Add new references to an.json for Aragonese language

* Add new IPA symbols for 'y' and adjust formatting

* Fix formatting in an.json for vowel entries

* Rename test_z_affricate_default to test_z_fricative_default

* Refactor tests for Aragonese grapheme and allophone

* Reorder phonetic symbols in an.json

* Fix JSON formatting in an.json, adding allophone

* Fix JSON formatting in an.json

* Update phonetic entries in an.json

Removed 'v' and 'β' from the phonetic representation.

* Comment out test for isogloss without theta

* Fix JSON formatting in an.json

* Update an.json
Audit all URLs in language-spec JSONs for existence. Wikipedia links are
checked via the MediaWiki API; other links via HTTP GET. Dead links are
cross-checked against the Wayback Machine. Remove 21 unambiguously dead URLs
(MediaWiki "missing" or clean 404/410) from structured fields, keeping
surrounding bibliographic text. Links returning 403/429/timeout/5xx, dead
links that are the sole Wikipedia entry of a non-stub language, dead links in
prose notes, and an explicitly protected URL are retained and flagged in
docs/link-audit.md for human review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reconcile the Aragonese test class with the expert-revised an.json
grapheme and positional data: open vowels written è/ò, betacism on v,
ʝ-initial y, x/tʃ for g before front vowels, and dental fricative
realisations without seseo. Drop assertions for graphemes and
positional rules no longer present in the spec.
Replace the dead العربية_الجزائرية URL with the verified live
لهجة_جزائرية article and record the correction in the link audit.
…xports (#17)

- lm.py gains the future-annotations import its PEP 585 hints require
  under the declared python_requires >=3.9
- types.py union hints use Optional[...] like the rest of the codebase
- plugin discovery logs a warning identifying the entry point when a
  G2P plugin fails to load instead of silently skipping it
- get_plugin, G2PPlugin, WordContext and SandhiEngine join the public
  API exports
- package description matches the shipped language-code count
- new test module guards annotation compatibility, failure logging and
  the export surface
…dia links (#20)

- oc gains a full Lengadocian-norm inventory (56 graphemes, 29
  allophones, positional lenition/final rules) sourced from Alibert
  and Wheeler; it was the only living language resolving to an empty
  spec
- every spec now declares an explicit quality tier; the extinct
  metadata-only placeholders got, mxi and xsb are tiered stub
- the 21 dead wikipedia links kept for review are replaced with
  MediaWiki-API-verified live articles and the audit tables updated
- new guard tests: explicit tier on every spec, and non-stub specs
  must resolve to non-empty grapheme and allophone inventories
#18)

* fix: py3.9 annotation compatibility, plugin-failure logging, public exports

- lm.py gains the future-annotations import its PEP 585 hints require
  under the declared python_requires >=3.9
- types.py union hints use Optional[...] like the rest of the codebase
- plugin discovery logs a warning identifying the entry point when a
  G2P plugin fails to load instead of silently skipping it
- get_plugin, G2PPlugin, WordContext and SandhiEngine join the public
  API exports
- package description matches the shipped language-code count
- new test module guards annotation compatibility, failure logging and
  the export surface

* feat(registry): resolve bare language tags and nearest-match fallbacks

- bare primary tags resolve to a curated reference variety
  (pt -> pt-PT, en -> en-GB, ...), matching the ISO 639-3 alias
  convention rather than langcodes' population-based default
- unregistered regional tags fall back to the nearest registered code
  by language distance via ovos-spec-tools (en-NZ -> en-GB)
- public resolve() exposes the resolution without loading a spec
- _resolve_code is cached; a data-driven test asserts every registered
  code's bare primary subtag stays loadable

* ci: fix coverage install_extras and exclude self from license check

install_extras:'test' was passed as a bare package name instead of
'.[test]'; the gh-automations coverage workflow treats it as a pip
install target. The license check now excludes orthography2ipa itself
(pilosus reports Error for packages installed from source builds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…rity dispatch (#21)

- WordContext gains defaulted fields: orthographic prev/next word,
  sentence-final flag, word index/count and the resolved language code
- G2PPlugin gains non-abstract hooks: normalize (pre-G2P text
  preparation), post_process (context-aware IPA adjustment) and a
  priority property used as dispatch tie-break
- plugin discovery keeps the highest-priority plugin when several
  claim the same language code, regardless of registration order
…#23)

* feat(schema): declarative stress rules with detection and IPA marking

- StressRules frozen dataclass on LanguageSpec (optional, own-file
  only), validated by a strict pydantic model in lockstep
- new stress module: naive vowel-group syllabifier, detect_stress
  (marked vowels > oxytone endings > penult endings > default
  position) and apply_stress_mark with end-anchored alignment so
  orthographic and IPA syllable counts may differ
- pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked
  -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns
  and final -i/-u attract stress
- consumers with a real syllabifier pass their own syllable list

* ci: fix coverage install_extras and exclude self from license check

install_extras:'test' was passed as a bare package name instead of
'.[test]'; the gh-automations coverage workflow treats it as a pip
install target. The license check now excludes orthography2ipa itself.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* feat(schema): declarative stress rules with detection and IPA marking

- StressRules frozen dataclass on LanguageSpec (optional, own-file
  only), validated by a strict pydantic model in lockstep
- new stress module: naive vowel-group syllabifier, detect_stress
  (marked vowels > oxytone endings > penult endings > default
  position) and apply_stress_mark with end-anchored alignment so
  orthographic and IPA syllable counts may differ
- pt-PT and pt-BR seeded with Acordo Ortografico 1990 rules; unmarked
  -em/-am endings stay paroxytone (homem, falam) while -im/-om/-um/-ns
  and final -i/-u attract stress
- consumers with a real syllabifier pass their own syllable list

* feat(plugin): per-language syllabifier entry-point group

- SyllabifierPlugin ABC discovered via the orthography2ipa.syllabify
  entry-point group, priority-resolved like G2P plugins
- detect_stress accepts lang= and uses the registered syllabifier for
  that language (silabificador for Portuguese, pycotovia for Galician)
  before falling back to the naive vowel-group splitter
- plugin output is trusted only when its syllables rebuild the word
- get_syllabifier exported from the package root

* ci: fix coverage install_extras and exclude self from license check

install_extras:'test' was passed as a bare package name instead of
'.[test]'; the gh-automations coverage workflow treats it as a pip
install target. The license check now excludes orthography2ipa itself.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants