Add Kannada (kn-IN) G2P support for TTS#15582
Add Kannada (kn-IN) G2P support for TTS#15582jasro23 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
2cb28e8 to
8ddb3c2
Compare
|
@jasro23 Can you also add support for Telugu language |
@annagirimokshith . I am not too familiar with Telugu. |
There was a problem hiding this comment.
Pull request overview
Adds Kannada (kn-IN) grapheme-to-phoneme (G2P) support for NeMo TTS, including a new Kannada IPA G2P implementation, locale character sets/punctuation, a pronunciation dictionary, and unit tests.
Changes:
- Introduce
KannadaG2pwith hybrid dictionary + rule-based IPA conversion. - Add
kn-INgrapheme and IPA character sets plus locale punctuation handling. - Add a Kannada pronunciation lexicon and basic unit tests validating G2P outputs.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
nemo/collections/tts/g2p/models/kn_in_ipa.py |
New Kannada G2P implementation (dictionary + rule-based). |
nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py |
Adds kn-IN locale support, including grapheme/IPA sets and punctuation. |
scripts/tts_dataset_files/kn_IN/kn_IN_nv260318.dict |
New Kannada pronunciation dictionary (~4.3K entries). |
tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py |
Adds unit tests for Kannada G2P behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Handle digits (pass through or convert) | ||
| if char.isdigit(): | ||
| phonemes.append(char) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Handle Kannada digits | ||
| kannada_digits = '೦೧೨೩೪೫೬೭೮೯' | ||
| if char in kannada_digits: | ||
| # Convert to Arabic numeral | ||
| phonemes.append(str(kannada_digits.index(char))) | ||
| i += 1 | ||
| continue |
There was a problem hiding this comment.
Kannada digits (೦-೯) will never reach the Kannada-digit conversion block because str.isdigit() is true for Kannada digits, so the earlier if char.isdigit(): phonemes.append(char) branch consumes them. This causes Kannada digits to be returned unchanged instead of being mapped to Arabic numerals as intended. Consider checking for Kannada digits before isdigit(), or restricting the isdigit() branch to ASCII digits only (e.g., char.isascii() and char.isdigit()).
| phoneme_dict = ( | ||
| self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | ||
| if isinstance(phoneme_dict, (str, pathlib.Path)) | ||
| else phoneme_dict | ||
| ) |
There was a problem hiding this comment.
When phoneme_dict is passed as a Python dict, entries like {word: ["namaskaːɾa"]} are kept as whole-string tokens, but the rule-based path emits per-character IPA tokens. This makes outputs inconsistent across dictionary vs OOV words and can break downstream tokenizers that expect single-symbol IPA tokens (e.g., 'ː' separate from 'a'). Consider normalizing dict-provided pronunciations by splitting each pronunciation string into a list of IPA symbols/characters (and applying phoneme_prefix) to match _parse_phoneme_dict behavior.
| phoneme_dict = ( | |
| self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | |
| if isinstance(phoneme_dict, (str, pathlib.Path)) | |
| else phoneme_dict | |
| ) | |
| if isinstance(phoneme_dict, (str, pathlib.Path)): | |
| phoneme_dict = self._parse_phoneme_dict(phoneme_dict, phoneme_prefix) | |
| else: | |
| normalized_phoneme_dict = {} | |
| for word, prons in phoneme_dict.items(): | |
| normalized_prons = [] | |
| for pron in prons: | |
| if isinstance(pron, str): | |
| normalized_prons.extend([phoneme_prefix + symbol for symbol in pron]) | |
| else: | |
| normalized_prons.extend(pron) | |
| normalized_phoneme_dict[word] = normalized_prons | |
| phoneme_dict = normalized_phoneme_dict |
| import re | ||
| import unicodedata | ||
| from collections import defaultdict | ||
| from typing import Dict, List, Optional, Union | ||
|
|
||
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( | ||
| GRAPHEME_CHARACTER_SETS, |
There was a problem hiding this comment.
re and GRAPHEME_CHARACTER_SETS are imported but not used in this module. Removing unused imports will avoid lint failures and keep the file clean.
| import re | |
| import unicodedata | |
| from collections import defaultdict | |
| from typing import Dict, List, Optional, Union | |
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( | |
| GRAPHEME_CHARACTER_SETS, | |
| import unicodedata | |
| from collections import defaultdict | |
| from typing import Dict, List, Optional, Union | |
| from nemo.collections.common.tokenizers.text_to_speech.ipa_lexicon import ( |
- Add KannadaG2p class with hybrid dictionary + rule-based IPA conversion - Add Kannada grapheme and IPA character sets to ipa_lexicon.py - Add kn-IN locale support with punctuation handling - Include lexicon with 4264 Kannada words - Add test script with assertions for validation The G2P module handles: - All Kannada vowels, consonants, matras (dependent vowels) - Virama (halant), anusvara, visarga - Anusvara place assimilation based on following consonant Signed-off-by: Jason Roche <jas.tech23@gmail.com>
5bb1488 to
d67c70b
Compare
Signed-off-by: Jason Roche <jas.tech23@gmail.com>
The G2P module handles:
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add Kannada (kn-IN) G2P support for TTS
Collection: [TTS]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information