Hi there, I am trying to run pretrain_bert.py using a small Wikipedia corpus consisting of 502 documents. I have set the split ratio as 449, 50, 1 for train, valid, test. With train_iters as 20, eval_iters set to 1 and eval_interval as 1, I encountered the following message in the generated log file:
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 20
validation: 21
test: 1
I expected the dataset split process not to hang, as my split reached the target sizes for all three portions. However, the mapping was successful for the training and validation sets, but it hung during the test mapping. When I adjusted the split to 472, 30, 20, the run was successful. How should one assess the target sizes (minimum sizes) for the dataset split?
Adding reference code for above log:
https://github.com/NVIDIA/Megatron-LM/blob/de4028a9d45bd65c67e1a201d9e0690bd6cb4304/megatron/training.py#L1079
Thanks
Hi there, I am trying to run pretrain_bert.py using a small Wikipedia corpus consisting of 502 documents. I have set the split ratio as 449, 50, 1 for train, valid, test. With train_iters as 20, eval_iters set to 1 and eval_interval as 1, I encountered the following message in the generated log file: