How to fix aberrant pronounciation? #180

ajkessel · 2024-10-19T18:44:22Z

ajkessel
Oct 19, 2024

I'm using something along the lines of the suggested workflow in interference-cli.py, as shown here. I've trained it on my own voice clip and then generate speech with the text of the "the hare and the tortoise" fable. "Tortoise" is consistently pronounced wrong, as you can hear in this audio clip. How would I go about figuring out the cause of this and improving the model? It would seem surprising if the training data didn't include that word. Obviously, I'm not going to try to fix individual pronunciations one by one, but I'm trying to understand the underlying causes and how to approach them.

Maybe this is just a matter of learning to use the fine tuning cli? But I would appreciate a succinct step-by-step example if so. Is there an easy way to use datasets other than Emilia? Maybe one trained with librivox? It's not clear to me quite how you would go about importing and generating from a different dataset.

SWivid · 2024-10-20T04:21:46Z

SWivid
Oct 20, 2024
Maintainer

Mainly because English words are broken into characters (letters) rather than go through a g2p first.
Thought a tortoise -> tortas trick might serve as a interim solution.

3 replies

ajkessel Oct 20, 2024
Author

So is this something that wouldn't improve with more training? Is there a longer-term plan to revamp the workflow so the underlying issue doesn't occur? There are quite a few weird pronunciations I've noticed in testing.

SWivid Oct 20, 2024
Maintainer

a longer-term plan to revamp the workflow

Yes √

ajkessel Oct 20, 2024
Author

Happy to test a revamped workflow when ready to see how it does on these pronunciation issues.

In the meantime, have you tried training/fine-tuning with publicly available datasets like ljspeech? It looks like the data is already in a format that can easily go into train.py with just a quick modification of metadata.csv. I tried running it on my own machine but it looks like it would take weeks with my meager resources.

kmn1024 · 2024-10-21T03:00:20Z

kmn1024
Oct 21, 2024

@SWivid You mention this is likely a G2P problem. So Chinese which already uses pinyin should be OK?

I don't quite understand why more data wouldn't solve this. Since the model is deep enough, shouldn't it be able to construct a G2P-like representation internally? I worry that by using an external G2P system, the problem will just be moved from the model to the G2P system (for example, if the G2P system cannot distinguish between sake (the alcohol drink) and sake ("for goodness' sake")).

One more question =) I'm thinking of Japanese and Korean too. If I just use their 'alphabet' (for example, Hiragana and Hangul), I will likely run into the same problems that English face right? So it's better to go through G2P for everything? Or does the long term fix involve something else?

2 replies

kmn1024 Oct 21, 2024

I asked Claude about Japanese and Korean...

English (Latin alphabet):
As you correctly pointed out, English has a complex relationship between its spelling and pronunciation. This is due to various historical factors, including borrowed words, sound changes over time, and inconsistent spelling conventions. As a result, a grapheme-to-phoneme (G2P) conversion is often necessary to determine the correct pronunciation of a word.

Japanese (Hiragana):
Hiragana is much closer to a phonetic writing system than the English alphabet. Each hiragana character represents a specific syllable (usually a consonant-vowel pair or a single vowel), and the pronunciation is generally consistent. There are a few exceptions and nuances, but overall, hiragana is much more straightforward in terms of pronunciation than English.

Key points about hiragana:

Each character represents a specific sound or syllable.
The pronunciation is largely consistent across different words.
There's a close one-to-one correspondence between the written form and pronunciation.

While not strictly necessary, a simple form of G2P conversion might be used for:

Determining pitch accent (which is not represented in the writing system)
Handling some phonological processes like vowel devoicing

Korean (Hangul):
Hangul is even closer to a purely phonetic writing system than hiragana. It was specifically designed to represent the sounds of the Korean language accurately. Each Hangul character (syllable block) is composed of individual letters (jamo) that represent consonants and vowels.

Key points about Hangul:

The shape of the letters often corresponds to the shape of the mouth when pronouncing them.
Each syllable block represents a specific sound combination.
The system is highly regular and consistent.

For Hangul, a full G2P conversion process is generally not necessary. However, there are some phonological rules in Korean that affect pronunciation, such as:

Consonant assimilation
Vowel length (though this is becoming less prominent in modern Korean)
Some sound changes at syllable boundaries

In summary:

English: Requires extensive G2P conversion for accurate pronunciation.
Japanese (Hiragana): Much closer to phonetic, but may benefit from some G2P processing for pitch accent and specific phonological processes.
Korean (Hangul): Very close to phonetic, with a highly regular system. Some phonological rules may apply, but full G2P conversion is generally unnecessary.

Both hiragana and Hangul are much closer to representing the actual sounds of their respective languages compared to the English alphabet, making pronunciation more straightforward for readers.

SWivid Oct 21, 2024
Maintainer

yes the point is to have a tokenizer enabling explicit and efficient pronunciation control of user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to fix aberrant pronounciation? #180

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to fix aberrant pronounciation? #180

Uh oh!

Uh oh!

ajkessel Oct 19, 2024

Replies: 2 comments · 5 replies

Uh oh!

SWivid Oct 20, 2024 Maintainer

Uh oh!

ajkessel Oct 20, 2024 Author

Uh oh!

SWivid Oct 20, 2024 Maintainer

Uh oh!

ajkessel Oct 20, 2024 Author

Uh oh!

Uh oh!

kmn1024 Oct 21, 2024

Uh oh!

kmn1024 Oct 21, 2024

Uh oh!

SWivid Oct 21, 2024 Maintainer

ajkessel
Oct 19, 2024

Replies: 2 comments 5 replies

SWivid
Oct 20, 2024
Maintainer

ajkessel Oct 20, 2024
Author

SWivid Oct 20, 2024
Maintainer

ajkessel Oct 20, 2024
Author

kmn1024
Oct 21, 2024

SWivid Oct 21, 2024
Maintainer