Skip to content

How to split the dataset when running pretrain_bert.py #676

@Druva24

Description

@Druva24

Hi there, I am trying to run pretrain_bert.py using a small Wikipedia corpus consisting of 502 documents. I have set the split ratio as 449, 50, 1 for train, valid, test. With train_iters as 20, eval_iters set to 1 and eval_interval as 1, I encountered the following message in the generated log file:

building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 20
validation: 21
test: 1

I expected the dataset split process not to hang, as my split reached the target sizes for all three portions. However, the mapping was successful for the training and validation sets, but it hung during the test mapping. When I adjusted the split to 472, 30, 20, the run was successful. How should one assess the target sizes (minimum sizes) for the dataset split?
Adding reference code for above log:
https://github.com/NVIDIA/Megatron-LM/blob/de4028a9d45bd65c67e1a201d9e0690bd6cb4304/megatron/training.py#L1079
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleNo activity in 60 days on issue or PR

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions