Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 6 additions & 14 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,20 @@
# Changelog

## [0.2.0a2](https://github.com/TigreGotico/tugaphone/tree/0.2.0a2) (2026-02-06)
## [0.2.2a2](https://github.com/TigreGotico/tugaphone/tree/0.2.2a2) (2026-05-29)

[Full Changelog](https://github.com/TigreGotico/tugaphone/compare/0.2.0a1...0.2.0a2)
[Full Changelog](https://github.com/TigreGotico/tugaphone/compare/0.2.2a1...0.2.2a2)

**Merged pull requests:**

- Configure Renovate [\#3](https://github.com/TigreGotico/tugaphone/pull/3) ([renovate[bot]](https://github.com/apps/renovate))
- docs: add docs/ and examples/ [\#22](https://github.com/TigreGotico/tugaphone/pull/22) ([JarbasAl](https://github.com/JarbasAl))

## [0.2.0a1](https://github.com/TigreGotico/tugaphone/tree/0.2.0a1) (2026-02-06)
## [0.2.2a1](https://github.com/TigreGotico/tugaphone/tree/0.2.2a1) (2026-02-25)

[Full Changelog](https://github.com/TigreGotico/tugaphone/compare/0.1.0a1...0.2.0a1)
[Full Changelog](https://github.com/TigreGotico/tugaphone/compare/0.2.1...0.2.2a1)

**Merged pull requests:**

- feat: regional accent transformations [\#1](https://github.com/TigreGotico/tugaphone/pull/1) ([JarbasAl](https://github.com/JarbasAl))

## [0.1.0a1](https://github.com/TigreGotico/tugaphone/tree/0.1.0a1) (2026-02-06)

[Full Changelog](https://github.com/TigreGotico/tugaphone/compare/0.0.2...0.1.0a1)

**Merged pull requests:**

- feat: new phonemizer + postag backends [\#4](https://github.com/TigreGotico/tugaphone/pull/4) ([JarbasAl](https://github.com/JarbasAl))
- refactor: upstream some logic to dependencies [\#16](https://github.com/TigreGotico/tugaphone/pull/16) ([JarbasAl](https://github.com/JarbasAl))



Expand Down
26 changes: 8 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,14 @@ pip install tugaphone

## 🧰 Usage

### Companion libraries

The follow libraries are dependencies of tugaphone and might be useful on their own

- [Tugalex](https://github.com/TigreGotico/tugalex) - Lexicon of words and exceptions
- [TugaTagger](https://github.com/TigreGotico/tugatagger) - portuguese text postagger
- [silabificador](https://github.com/TigreGotico/silabificador) - portuguese text syllabification

### Basic Phonemization

```python
Expand Down Expand Up @@ -100,24 +108,6 @@ print(normalize_numbers("897654356789098", "pt-PT")) # long-scale (biliões)
print(normalize_numbers("897654356789098", "pt-BR")) # short-scale (trilhões)
```

### Syllabification

```python
from tugaphone.syl import syllabify

words = ["casa", "Brasil", "extraordinário", "português"]

for word in words:
syllables = syllabify(word)
print(f"{word} → {'.'.join(syllables)}")

# Output:
# casa → ca.sa
# Brasil → bra.sil
# extraordinário → ex.tra.or.di.ná.rio
# português → por.tu.guês
```

### Advanced: Tokenization and Features

```python
Expand Down
121 changes: 121 additions & 0 deletions docs/advanced.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Advanced recipes

Once the basic `phonemize_sentence` loop is clear, these are the knobs worth
knowing.

## POS engines and homographs

Portuguese homographs change pronunciation by part of speech. `tugaphone` tags
the sentence first and feeds the tags into transcription, so the engine you pick
affects accuracy:

```python
from tugaphone import TugaPhonemizer

ph = TugaPhonemizer(postag_engine="spacy") # most accurate, needs pt_core_news_lg
ph.phonemize_sentence("Vou para casa.") # 'para' as preposition
ph.phonemize_sentence("Ele para o carro.") # 'para' as verb
```

Engine options, from heaviest to lightest:

| Engine | Needs | Notes |
|--------|-------|-------|
| `spacy` | `spacy` + `pt_core_news_lg` | Most accurate. |
| `brill` | `brill-postaggers` | Lighter, faster; installed via `tugatagger[brill]`. |
| `lexicon` | nothing extra | Built-in lookup, limited coverage. |
| `dummy` | nothing | Rule-based fallback, no dependencies. |
| `auto` | — | Falls through whatever is installed. Default. |

If you only need deterministic output with no optional dependencies, construct
with `postag_engine="dummy"`.

## Regional accents

On top of the five dialect codes, `tugaphone.regional` ships sub-regional accent
presets as `RegionalTransforms`. Pass one through `regional_dialect`; it is
applied on top of the `lang` dialect:

```python
from tugaphone import TugaPhonemizer
from tugaphone.regional import (PortoDialect, MinhoDialect, BragaDialect,
TrasMontanoDialect, FafeDialect)

ph = TugaPhonemizer()
sentence = "a gente sente o que sabe"
for name, accent in [("porto", PortoDialect), ("minho", MinhoDialect),
("braga", BragaDialect), ("trasmontano", TrasMontanoDialect),
("fafe", FafeDialect)]:
print(name, "→", ph.phonemize_sentence(sentence, "pt-PT", regional_dialect=accent))
```

| Preset | Signature features |
|--------|--------------------|
| `CoimbraDialect` | Diphthong retention only (neutral baseline). |
| `MinhoDialect` | Vowel-centralization resistance, open vowels, alveolar rhotic. |
| `BragaDialect` | Palatal epenthesis (`abelha` → `abeilha`) on top of northern rules. |
| `FamalicaoDialect` | Conservative `o`-nasal retention (`Famalicão` → `Famalicoum`). |
| `TrasMontanoDialect` | `ch` affrication, s-voicing, final nasal denasalization. |
| `PortoDialect` | Rising `o` diphthong (`Porto` → `Puorto`). |
| `FafeDialect` | Nasal diphthongization of `e` (`gente` → `geinte`). |

These are explicitly experimental — real variation is messier than any rule set.

### Serializing an accent

`RegionalTransforms` round-trips through a plain dict, so an accent config can
live in JSON or YAML:

```python
from tugaphone.regional import PortoDialect, RegionalTransforms

cfg = PortoDialect.as_dict
# {'morpheme_rules': [], 'ipa_rules': ['rising_diphthong_o', ...]}

clone = RegionalTransforms.from_dict(cfg)
[r.__name__ for r in clone.ipa_rules]
```

`from_dict` raises `ValueError` on an unknown IPA rule name. Only rules listed in
`tugaphone.regional.RULE_MAP` survive the round-trip; accents that use other rule
functions serialize a subset of their behaviour.

## Number normalization

`normalize_numbers` spells digits out before transcription and is independently
useful for any TTS front-end:

```python
from tugaphone.number_utils import normalize_numbers

normalize_numbers("vou comprar 1 casa") # 'vou comprar uma casa' (feminine)
normalize_numbers("vou adotar 1 cão") # 'vou adotar um cão' (masculine)
normalize_numbers("897654356789098", "pt-PT") # long scale (biliões)
normalize_numbers("897654356789098", "pt-BR") # short scale (trilhões)
```

Gender is inferred from preceding articles (`a`, `as`, `da`, `das`) and from the
shape of the following noun (`-a`, `-dade`, `-agem` endings lean feminine). Pass
`strict=False` to leave unparseable tokens in place instead of raising.

## Integration with sibling libraries

`tugaphone` composes three TigreGotico Portuguese NLP libraries; each is usable
on its own:

- [`tugalex`](https://github.com/TigreGotico/tugalex) — the phonetic lexicon
(`LEXICON` in `tugaphone.dialects`). `LEXICON.get_ipa_map(region=...)` returns
the per-region exception table.
- [`tugatagger`](https://github.com/TigreGotico/tugatagger) — the POS tagger
behind `postag_engine`.
- [`silabificador`](https://github.com/TigreGotico/silabificador) — the
syllabifier behind `WordToken.syllables`.

A TTS front-end typically wires `tugaphone` as the G2P stage: normalize text,
phonemize per target dialect, hand the IPA string to the acoustic model.

## Where next

- [api.md](api.md) — full signatures
- [tokenizer.md](tokenizer.md) — inspect syllables, stress and graphemes directly
- [quickstart.md](quickstart.md) — the basics
164 changes: 164 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# API Reference

Every public symbol, with the signatures and return shapes as they exist in the
source.

## `tugaphone.TugaPhonemizer`

The phonemizer entry class.

```python
TugaPhonemizer(postag_engine="auto", postag_model="pt_core_news_lg")
```

| Argument | Meaning |
|----------|---------|
| `postag_engine` | POS tagging backend passed to `TugaTagger`: `"auto"`, `"spacy"`, `"brill"`, `"lexicon"`, `"dummy"`. |
| `postag_model` | Model identifier for engines that take one (e.g. the spaCy model name). |

Construction builds the `TugaTagger` and warms the lexicon so the first
transcription is fast.

### `phonemize_sentence`

```python
phonemize_sentence(sentence: str,
lang: str = "pt-PT",
regional_dialect: Optional[RegionalTransforms] = None) -> str
```

Transcribes `sentence` to IPA for the target dialect. Returns a space-separated
phoneme string — one token per word, with `ˈ` for primary stress and `·` for
syllable boundaries; punctuation tokens are preserved.

`lang` is one of `pt-PT`, `pt-BR`, `pt-AO`, `pt-MZ`, `pt-TL`; any other value
falls back to European Portuguese.

When `regional_dialect` is given, the word is first run through the preset's
morpheme rules, transcribed, then run through its IPA rules. See
[`RegionalTransforms`](#tugaphoneregionalregionaltransforms).

```python
ph = TugaPhonemizer()
ph.phonemize_sentence("O gato dorme.", "pt-BR") # 'ˈu gˈa·tʊ ˈdɔh·me'
```

### `get_dialect_inventory` (staticmethod)

```python
TugaPhonemizer.get_dialect_inventory(lang: str = "pt-PT") -> DialectInventory
```

Maps a dialect code to its `DialectInventory` instance (`EuropeanPortuguese`,
`BrazilianPortuguese`, `AngolanPortuguese`, `MozambicanPortuguese`,
`TimoresePortuguese`).

## `tugaphone.number_utils`

### `normalize_numbers`

```python
normalize_numbers(text: str, lang: str = "pt-PT", strict: bool = True) -> str
```

Replaces numeric tokens in a sentence with their Portuguese written form,
inferring gender and ordinality from the surrounding words. `pt-PT` uses the
long scale (biliões), `pt-BR` the short scale (trilhões). With `strict=False`,
tokens that fail to format are left untouched instead of raising.

```python
from tugaphone.number_utils import normalize_numbers
normalize_numbers("vou comprar 1 casa") # 'vou comprar uma casa'
normalize_numbers("vou adotar 2 cães") # 'vou adotar dois cães'
normalize_numbers("1.5e10") # 'um vírgula cinco vezes dez elevado a dez'
```

### `NumberParser`

A classmethod-based helper underneath `normalize_numbers`. Useful when you need
finer control or want to interrogate a single token.

| Method | Returns |
|--------|---------|
| `pronounce_number_word(word, prev_word=None, next_word=None, gender=None, as_ordinal=None, is_brazilian=False)` | Spelled-out form of one numeric token. |
| `to_int(word)` / `is_int(word)` | Integer value (ordinal markers stripped) / membership test. |
| `to_float(word)` / `is_float(word)` | Float value / membership test. |
| `is_scientific_notation(word)` | `True` for forms like `"1.5e10"`. |
| `pronounce_scientific(word, is_brazilian=False)` | Spoken form of scientific notation. |
| `is_ordinal(word, next_word=None)` | Detects `º`/`ª` markers, attached or separate. |
| `get_number_gender(word, prev_word=None, next_word=None)` | `"feminine"` or `"masculine"`. |

```python
from tugaphone.number_utils import NumberParser
NumberParser.pronounce_number_word("19", is_brazilian=True) # 'dezenove'
NumberParser.pronounce_number_word("19", is_brazilian=False) # 'dezanove'
NumberParser.pronounce_number_word("1", next_word="º") # 'primeiro' (ordinal)
NumberParser.get_number_gender("1", next_word="casa") # 'feminine'
```

## `tugaphone.regional.RegionalTransforms`

A serializable dataclass holding the rules for a sub-regional accent.

```python
@dataclass
class RegionalTransforms:
morpheme_rules: List[MorphemeTransform] = [] # applied to the word before G2P
ipa_rules: List[IPATransform] = [] # applied to the IPA after G2P
```

| Member | Behaviour |
|--------|-----------|
| `apply_morpheme(word, postag="NOUN")` | Runs every morpheme rule in order, returns the rewritten word. |
| `apply_ipa(word, phonemes, postag="NOUN")` | Runs every IPA rule in order, returns the rewritten phoneme string. |
| `as_dict` (property) | Serializes the rule lists to rule-name strings. |
| `from_dict(data)` (staticmethod) | Rebuilds an instance from `{"ipa_rules": [...], "morpheme_rules": [...]}`; raises `ValueError` on an unknown IPA rule name. |

```python
from tugaphone.regional import PortoDialect, RegionalTransforms

cfg = PortoDialect.as_dict
clone = RegionalTransforms.from_dict(cfg)
[r.__name__ for r in clone.ipa_rules] # ['rising_diphthong_o', ...]
```

### Preset accents

Importable from `tugaphone.regional`: `CoimbraDialect`, `MinhoDialect`,
`BragaDialect`, `FamalicaoDialect`, `TrasMontanoDialect`, `PortoDialect`,
`FafeDialect`. Each is a ready-built `RegionalTransforms`. Pass any of them to
`phonemize_sentence(..., regional_dialect=...)`.

Only the IPA rules listed in `RULE_MAP` round-trip through `as_dict`/`from_dict`;
accents built from other rule functions serialize a subset.

## `tugaphone.dialects`

| Symbol | Role |
|--------|------|
| `DialectInventory` | Base class: phoneme maps, character sets, stress/punctuation tokens. `dialect_code` attribute carries the tag. |
| `EuropeanPortuguese`, `BrazilianPortuguese`, `AngolanPortuguese`, `MozambicanPortuguese`, `TimoresePortuguese` | The five dialect inventories. |
| `LisbonPortuguese`, `RioJaneiroPortuguese`, `SaoPauloPortuguese` | City-specific inventories layered on the base dialects. |
| `LEXICON` | Module-level `TugaLexicon()` instance; `LEXICON.get_ipa_map(region=...)` returns the per-region exception map. |

You rarely instantiate these directly — `TugaPhonemizer` does it for you — but
they are the `dialect` argument the tokenizer accepts.

## `tugaphone.tokenizer`

The hierarchical model. See [tokenizer.md](tokenizer.md) for the full walkthrough;
the public surface is:

| Symbol | Role |
|--------|------|
| `Sentence(surface, words=[], dialect=EuropeanPortuguese())` | Top-level container; `.ipa`, `.words`, `.n_words`, `.features`. |
| `Sentence.from_postagged(surface, tags, dialect=None)` | Build from `(token, pos)` pairs (the path `TugaPhonemizer` uses). |
| `WordToken` | `.surface`, `.syllables`, `.graphemes`, `.stressed_syllable_idx`, `.ipa`, `.features`. |
| `GraphemeToken` | `.surface`, `.ipa`, `.is_diphthong`, `.is_nasal`, `.is_digraph`, `.features`, and more predicates. |
| `CharToken` | character-level predicates (`.is_vowel`, `.is_consonant`, `.ipa`, ...). |

## Where next

- [quickstart.md](quickstart.md) — install and first call
- [advanced.md](advanced.md) — recipes for accents, POS engines, numbers
- [tokenizer.md](tokenizer.md) — the token tree and feature extraction
Loading