Skip to content

File Handles released frequently during fully random access workloads #81

@bernardhan33

Description

@bernardhan33

Hi Team and @philipp-fischer ,

I'm testing my workload that enables fully random access with shuffle_buffer_size==1 and max_samples_per_sequence==1 and I noticed that the storage throughput was pretty bad due to repeated OpenFile which makes the file handles released and recreated for every micro-batch.

Digging into the code I bit I noticed a few places which seem to provide customization for this behavior. Namely,

  1. itar_cache_size (code ref) which controls the size of the LRU cache that stores the file descriptors.
  2. parallel_shard_iters (code ref) which constructs the itar_cache_size parameter.

Ideally in the fully random access we'd want all file handles to be kept open until the end of the training.

I'm using the basic Megatron-Energon code:

train_dataset = get_train_dataset(
      cfg.model.energon.path,
      batch_size=cfg.model.micro_batch_size,
      task_encoder=task_encoder,
      worker_config=worker_config,
      max_samples_per_sequence=max_samples_per_sequence,
      packing_buffer_size=None,
      shuffle_buffer_size=cfg.model.energon.shuffle_buffer_size,
      split_part='train',
  )
  data_loader = get_savable_loader(train_dataset, worker_config=worker_config)

Therefore, while I'm figuring out ways to pass these parameters through, I'd like to reach out to see if I'm on the right track and if you can help me configure these parameters quickly. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions