I'm learing from your project. I want sone help about custom dataset, vocabulary, and retrain from zero.

### Checks

- [x] This template is only for usage issues encountered.
- [x] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- [x] I have searched for existing issues, including closed ones, and couldn't find a solution.
- [x] I am using English to submit this issue to facilitate community communication.

### Environment Details

Kaggle notebook: Python 3.12.2
Torch version: 2.8.0+cu126
I use GPU P100.

### Steps to Reproduce

Sorry if I'm annoying. I want to understand your project so I'm trying to clone your project in Kaggle for english language only since i don't have any local computer or any server for training.
I clone most of your modules and function. I only adjust dataset module and vocabulary.
My vocabulary: 
symbols = [
    ' ', '!', ',', '-', '.', \
    ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', \
    'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', \
    'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '“', '”' \
  ]
print(len(symbols))

vocab_char_map = {}
for i, char in enumerate(symbols):
    vocab_char_map[char[:]] = i
vocab_size= len(symbols)

My config: based on your small config.

mel_spec_kw = dict(	target_sample_rate= 24_000,
					n_mel_channels= 100,
					hop_length= 256,
					win_length= 1024,
					n_fft= 1024,
					mel_spec_type= 'vocos',
)

model_arch = dict(
    dim= 768,
    depth= 18,
    heads= 12, 
    ff_mult= 2,
    text_dim= 512,
    text_mask_padding=False,
    conv_layers= 4,
    pe_attn_head= 1,
    attn_backend= 'torch',
    attn_mask_enabled= False,
    checkpoint_activations= False
)

My dataset is Librispeech-100h with normalized text file. I also removed prepare dataset functions and logging.

You can review my notebook here: https://www.kaggle.com/code/nguyenquoctuan12/f5tts-small

I'm still learning. I really appreciate your help. 

Sorry for my bad English. Have a nice day! 

### ✔️ Expected Behavior

A good generated wav and checkpoint.

### ❌ Actual Behavior

I'm currently reaching 130k step. The loss is around 0.6. I think the generated wav had ref wav's accent, vocal but i cant hear any word from it. I know you said the minimum step is 200k+ but i don't have much resource for training. Is there anything i did wrong. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I'm learing from your project. I want sone help about custom dataset, vocabulary, and retrain from zero. #1247

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

I'm learing from your project. I want sone help about custom dataset, vocabulary, and retrain from zero. #1247

Description

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions