Ignored punctuation: comma, dash, semi-colon, ellipsis #469
Replies: 1 comment
-
|
Hi @ThePowerTool! I'm Dosu and I’m helping the OpenVoice team. OpenVoice doesn't currently support explicit pause or cadence control via punctuation or custom tags—punctuation marks like commas, dashes, semi-colons, and ellipses are recognized as symbols, but the model only interprets them in a nominal way, without fine-grained timing control. There's no built-in syntax for specifying pause durations (like ,+[ms] or ...-[ms]), and the text processing pipeline doesn't re-interpret punctuation for timing or cadence adjustments. Any pause or cadence effects rely on the model's implicit handling of punctuation, which is limited and not configurable by the user source. If you need more precise control, you could try breaking your text into shorter phrases or manually post-process the audio to insert pauses, but there’s no direct way to inject custom pause durations in the current system. Supporting true fine-grained pause control would require architectural changes or model retraining to interpret custom tags or timing cues. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
One key component in voice cloning (and sounding human) is cadence and pauses. Most languages have punctuation providing a modicum of direction to the speaker. In English it's the comma, dash, semi-colon, and ellipsis.
These cues, at first glance, seem to be completely ignored. But, that's not entirely the case. In some instances there is a nominal difference but not enough to hide the synthetic nature of the voice.
Is there a trick I am missing to provide the necessary cues?
I would think the obvious answer would be a TTS cue-tag for pause:
,+[ms] ,-[ms]
-+[ms], --[ms]
;+[ms], ;-[ms]
...+[ms], ...-[ms]
This would extend the capability to reducing the time (as well as increasing it). It would also provide excellent control in not--[100]only (see what I did there) allowing for faster creation of more--[100]human sounding speech, it would also eliminate the need for creating broken phrases and manually editing them to create the same, improved result.
There's no way to sound human without appropriate cadence and pauses. It may take a while for a keen ear to notice, but this greatly reduces the humanity of the resulting speech.
I should also mention, the cues mentioned here would allow incorrect pauses to be inserted providing an even more human-sounding result.
Beta Was this translation helpful? Give feedback.
All reactions