Skip to content

Looking for guidance on speaker stability / steerability in cloned TTS #26

Description

@EgilSandfeld

First off, thank you for open-sourcing TADA.
We’ve been evaluating TADA for a race-engineer-style TTS use case where the main requirement is one stable cloned voice for short operational lines like:

  • Listening.
  • Good afternoon, Egil.
  • Yellow ahead.
  • Yeah, no... that was actually quite good.

Main issue

We were able to improve quality a lot, but we still see intermittent failures like:

  • wrong-voice drift (different male voice, occasionally high-pitched/feminine lead-in)
  • whispery / too-low-volume lead-ins
  • occasional truncation of the last word(s)
  • occasional background/static-like artifacts
    The problem is no longer “always bad” — it is now more like mostly good, but not deterministic enough.

What helped most

The biggest improvement was better prompt audio, more than most inference knobs.
We tested several neutral prompt samples:

  • a more constrained, technical neutral prompt worked best
  • more conversational prompt samples sometimes sounded nicer on individual lines, but reduced speaker identity stability overall
    So for us, the best prompt was the one that constrained the model most, not the one that sounded most conversational in isolation.

Best setup we found so far

On HumeAI/tada-1b, our best result was approximately:

python
InferenceOptions(
    noise_temperature=0.75,
    acoustic_cfg_scale=2.2,
    duration_cfg_scale=0.8,
    num_flow_matching_steps=8,
    num_acoustic_candidates=2,
    scorer="likelihood",
)

We also:

  • pre-encoded prompts at startup
  • unloaded the encoder afterward
  • used torch.bfloat16

This improved both latency and output quality, but did not fully eliminate speaker drift.

What did not clearly solve it

  • increasing model size to tada-3b-ml
  • larger candidate counts
  • more expressive / more conversational prompt samples
  • higher flow matching steps alone

The 3B model was heavier/slower and still showed the same class of failures.

What we are asking

Do maintainers/community have recommendations for improving speaker stability / steerability specifically?

In particular:

  1. Are there best practices for prompt clips that maximize identity consistency?
  2. Should we be leaning harder on spkr_verification, and if so what spkr_verification_weight range is worth trying?
  3. Are whispery / under-energized lead-ins on conversational phrases a known issue?
  4. Is there any known way to reduce run-to-run voice drift further?
  5. Is there anything about the prompt encoding/alignment step that is especially important beyond clean audio + exact transcript?

Short summary

TADA can sound excellent, and prompt quality matters a lot, but the remaining blocker for us is consistent voice identity across runs, not just raw fidelity.

Any guidance would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions