Looking for guidance on speaker stability / steerability in cloned TTS

First off, thank you for open-sourcing TADA.
We’ve been evaluating TADA for a race-engineer-style TTS use case where the main requirement is **one stable cloned voice** for short operational lines like:
- `Listening.`
- `Good afternoon, Egil.`
- `Yellow ahead.`
- `Yeah, no... that was actually quite good.`
## Main issue
We were able to improve quality a lot, but we still see intermittent failures like:
- wrong-voice drift (different male voice, occasionally high-pitched/feminine lead-in)
- whispery / too-low-volume lead-ins
- occasional truncation of the last word(s)
- occasional background/static-like artifacts
The problem is no longer “always bad” — it is now more like **mostly good, but not deterministic enough**.
## What helped most
The biggest improvement was **better prompt audio**, more than most inference knobs.
We tested several neutral prompt samples:
- a more constrained, technical neutral prompt worked best
- more conversational prompt samples sometimes sounded nicer on individual lines, but reduced speaker identity stability overall
So for us, the best prompt was the one that constrained the model most, not the one that sounded most conversational in isolation.

## Best setup we found so far
On `HumeAI/tada-1b`, our best result was approximately:

```
python
InferenceOptions(
    noise_temperature=0.75,
    acoustic_cfg_scale=2.2,
    duration_cfg_scale=0.8,
    num_flow_matching_steps=8,
    num_acoustic_candidates=2,
    scorer="likelihood",
)
``` 

We also:

- pre-encoded prompts at startup
- unloaded the encoder afterward
- used torch.bfloat16

This improved both latency and output quality, but did not fully eliminate speaker drift.

## What did not clearly solve it

- increasing model size to tada-3b-ml
- larger candidate counts
- more expressive / more conversational prompt samples
- higher flow matching steps alone

The 3B model was heavier/slower and still showed the same class of failures.

## What we are asking
Do maintainers/community have recommendations for improving speaker stability / steerability specifically?

In particular:

1. Are there best practices for prompt clips that maximize identity consistency?
2. Should we be leaning harder on spkr_verification, and if so what spkr_verification_weight range is worth trying?
3. Are whispery / under-energized lead-ins on conversational phrases a known issue?
4. Is there any known way to reduce run-to-run voice drift further?
5. Is there anything about the prompt encoding/alignment step that is especially important beyond clean audio + exact transcript?

## Short summary
TADA can sound excellent, and prompt quality matters a lot, but the remaining blocker for us is consistent voice identity across runs, not just raw fidelity.

Any guidance would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Looking for guidance on speaker stability / steerability in cloned TTS #26

Main issue

What helped most

Best setup we found so far

What did not clearly solve it

What we are asking

Short summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Looking for guidance on speaker stability / steerability in cloned TTS #26

Description

Main issue

What helped most

Best setup we found so far

What did not clearly solve it

What we are asking

Short summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions