How to split the dataset when running pretrain_bert.py

Hi there, I am trying to run pretrain_bert.py using a small Wikipedia corpus consisting of 502 documents. I have set the split ratio as 449, 50, 1 for train, valid, test. With train_iters as 20, eval_iters set to 1 and eval_interval as 1, I encountered the following message in the generated log file:

> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      20
    validation: 21
    test:       1
    
    I expected the dataset split process not to hang, as my split reached the target sizes for all three portions. However, the mapping was successful for the training and validation sets, but it hung during the test mapping. When I adjusted the split to 472, 30, 20, the run was successful. How should one assess the target sizes (minimum sizes) for the dataset split?
    Adding reference code for above log:
    https://github.com/NVIDIA/Megatron-LM/blob/de4028a9d45bd65c67e1a201d9e0690bd6cb4304/megatron/training.py#L1079
    Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to split the dataset when running pretrain_bert.py #676

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to split the dataset when running pretrain_bert.py #676

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions