-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hi Team and @philipp-fischer ,
I'm testing my workload that enables fully random access with shuffle_buffer_size==1 and max_samples_per_sequence==1 and I noticed that the storage throughput was pretty bad due to repeated OpenFile which makes the file handles released and recreated for every micro-batch.
Digging into the code I bit I noticed a few places which seem to provide customization for this behavior. Namely,
itar_cache_size(code ref) which controls the size of the LRU cache that stores the file descriptors.parallel_shard_iters(code ref) which constructs theitar_cache_sizeparameter.
Ideally in the fully random access we'd want all file handles to be kept open until the end of the training.
I'm using the basic Megatron-Energon code:
train_dataset = get_train_dataset(
cfg.model.energon.path,
batch_size=cfg.model.micro_batch_size,
task_encoder=task_encoder,
worker_config=worker_config,
max_samples_per_sequence=max_samples_per_sequence,
packing_buffer_size=None,
shuffle_buffer_size=cfg.model.energon.shuffle_buffer_size,
split_part='train',
)
data_loader = get_savable_loader(train_dataset, worker_config=worker_config)Therefore, while I'm figuring out ways to pass these parameters through, I'd like to reach out to see if I'm on the right track and if you can help me configure these parameters quickly. Thanks!