Skip to content

I'm learing from your project. I want sone help about custom dataset, vocabulary, and retrain from zero. #1247

@Dungmo

Description

@Dungmo

Checks

  • This template is only for usage issues encountered.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

Kaggle notebook: Python 3.12.2
Torch version: 2.8.0+cu126
I use GPU P100.

Steps to Reproduce

Sorry if I'm annoying. I want to understand your project so I'm trying to clone your project in Kaggle for english language only since i don't have any local computer or any server for training.
I clone most of your modules and function. I only adjust dataset module and vocabulary.
My vocabulary:
symbols = [
' ', '!', ',', '-', '.',
';', '?', 'a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q',
'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '“', '”'
]
print(len(symbols))

vocab_char_map = {}
for i, char in enumerate(symbols):
vocab_char_map[char[:]] = i
vocab_size= len(symbols)

My config: based on your small config.

mel_spec_kw = dict( target_sample_rate= 24_000,
n_mel_channels= 100,
hop_length= 256,
win_length= 1024,
n_fft= 1024,
mel_spec_type= 'vocos',
)

model_arch = dict(
dim= 768,
depth= 18,
heads= 12,
ff_mult= 2,
text_dim= 512,
text_mask_padding=False,
conv_layers= 4,
pe_attn_head= 1,
attn_backend= 'torch',
attn_mask_enabled= False,
checkpoint_activations= False
)

My dataset is Librispeech-100h with normalized text file. I also removed prepare dataset functions and logging.

You can review my notebook here: https://www.kaggle.com/code/nguyenquoctuan12/f5tts-small

I'm still learning. I really appreciate your help.

Sorry for my bad English. Have a nice day!

✔️ Expected Behavior

A good generated wav and checkpoint.

❌ Actual Behavior

I'm currently reaching 130k step. The loss is around 0.6. I think the generated wav had ref wav's accent, vocal but i cant hear any word from it. I know you said the minimum step is 200k+ but i don't have much resource for training. Is there anything i did wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions