Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible) #1246

hcsolakoglu · 2026-01-12T03:48:24Z

With this PR, I'm integrating the RL workflow of the F5R into the F5-TTS while maintaining the default deterministic behavior and checkpoint compliance. Goal is to enable a two‑stage pipeline (Gaussian NLL warmup + GRPO RL
fine‑tuning) with a modular reward system and opt‑in robustness improvements, without changing the default training or inference paths.

Key changes:

Probabilistic output head (proj_out_ln_sig) with gaussian_nll objective and backward‑compatible checkpoint loading.
GRPO trainer and RL sampling utilities, with optional steps_plus_one and prompt‑length modes.
Pluggable reward system (RewardProvider, registry, combiner) + built‑in FunASR WER and WeSpeaker similarity providers (optional deps, lazy import, caching).
Reward logging improvements and optional Trackio support (drop‑in for W&B).
Optional stability knobs for GRPO (rl.kl_eps, rl.density_eps) while keeping F5R‑parity defaults.
Dynamic batch sampler optimization to avoid materializing repeated batches in memory.
Extensive tests covering Gaussian head, checkpoint compatibility, RL training step, reward plugins, device handling, and new opt‑ins.

Notes on compatibility:

Defaults remain deterministic (output_dist=deterministic, objective=mse), so existing training/inference and checkpoints work unchanged.
All deviations from F5R behavior are opt‑in and documented in README_RL.md.
README_RL.md updated with a concise RL runbook, dataset prep, reward model fetch, and recommended opt‑ins.

…ibility

hcsolakoglu · 2026-01-12T03:59:45Z

I have several ideas on how to initialize the probabilistic output head, so I will be implementing and testing multiple approaches. This is still a work in progress, but I have made significant headway. If anyone would like to guide the direction, feel free to run tests and share your feedback. @SWivid

…ibility

hcsolakoglu added 30 commits January 11, 2026 17:26

Add probabilistic pretrain and GRPO RL training, keep backward compat…

4c4a7b0

…ibility

Add probabilistic pretrain and GRPO RL training, keep backward compat…

6a6dac1

…ibility

Warn on missing gaussian ln_sig head during soft load

7db2ef9

Update WeSpeaker fetch script to use HF archives

63fa593

Add 8-bit optimizer support for GRPO and ignore checkpoints

e6c0e80

Document RL stages and improve GRPO logging

37c3860

Log GRPO metrics on main process

9d10435

Harden RL deps and training setup

2ccf797

Add tests for trainer and wespeaker guardrails

748b80a

Add opt-in per-sample prompt length for GRPO

6ca8ccd

Keep ref model eval in GRPO forward_rl

f40d3e5

Fix RL resume and pin RL deps

f16681a

Add reward provider device tests

f0636e3

Improve GRPO logging config and cadence

3dd00f5

Default test audio pack to HF dummy dataset

0681b27

Document RL smoke test workflow

439ec1a

Add trackio logging option

a6241d4

Clarify reward metric names in logs

6a73693

Add reward correctness tests

732c392

Document longer GPU run and better dataset

240ed9f

Add colab RL pipeline and char-level WER option

6172de3

Add FunASR ref_source option for audio-based WER

1360365

Document colab RL run notes and improve wandb logging fallback

ac3a777

Stabilize ruff import sorting for wandb

9005100

Add opt-in RL prompt bounds and steps+1

3926e1b

Align reward defaults and add clarity comments

f4a7956

Document latest RL branch changes and config defaults

0c33d73

Fix RL sample logging and config wiring

c4a5104

Fix range prompt handling in GRPO

bd3a3ce

Document recommended RL opt-ins

f7edea1

hcsolakoglu added 3 commits January 12, 2026 06:11

Add GRPO stability opt-ins and sampler generator

9e21ab7

Document sampler memory optimization

9d5c4c7

Add opt-in KL alignment and strict no-ref audio

9405b53

feat(rl): add opt-in legacy length check and max duration config

db9937e

hcsolakoglu force-pushed the rl-integration branch from 81a0560 to db9937e Compare January 12, 2026 14:31

hcsolakoglu added 5 commits January 13, 2026 01:25

Fix GRPO accumulation, skip-grad, and reward ref

9511fb1

Add tests for GRPO skip-grad and refs

89ae389

Init gaussian ln_sig head and add perf test

aebeb10

Add probabilistic pretrain and GRPO RL training, keep backward compat…

12aa12e

…ibility

Apply ruff formatting

3951a10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible) #1246

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible) #1246

Uh oh!

hcsolakoglu commented Jan 12, 2026

Uh oh!

hcsolakoglu commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible) #1246

Are you sure you want to change the base?

Add probabilistic pretrain + GRPO RL pipeline with pluggable rewards and tracking (backward‑compatible) #1246

Uh oh!

Conversation

hcsolakoglu commented Jan 12, 2026

Uh oh!

hcsolakoglu commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant