-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Your question
Ask a clear and concise question about Megatron-LM. Tag the @mcore-oncall
to get oncall's attention to this issue.
Environment:
- Container: nvcr.io/nvidia/pytorch:25.06-py3
- Model: Nemotron-5 8B proxy
- GPUs: 8 x B200 (single node)
- Parallelism: TP=1, PP=1, DP=8, distributed optimizer with --overlap-grad-reduce and --overlap-param-gather
- Data:
--mock-datawith--num-workers1 - Attention:
--attention-backendflash,NVTE_FUSED_ATTN=0 - Platform: Slurm cluster
While running the pretrain_gpt example with a mock dataset (--mock-data), we enabled the PyTorch profiler via the Megatron CLI flags
--use-pytorch-profiler
--profile
--profile-step-start 0
--profile-step-end 5
--tensorboard-dir ${RESULT_DIR}/profiler_traces
We then loaded the resulting .pt.trace.json files in TensorBoard's PyTorch Profiler plugin to investigate a performance issue we were seeing (not entirely relevant to this GH Issue).
As part of our investigation, we zeroed in on the autograd::engine::evaluate_function:torch::autograd::AccumulateGrad kernel op, whose stack trace looks like
autograd::engine::evaluate_function:torch::autograd::AccumulateGrad --> megatron/core/distributed/distributed_data_parallel.py(448): hook --> megatron/core/distributed/param_and_grad_buffer.py(472): register_grad_ready --> megatron/core/distributed/param_and_grad_buffer.py(322): start_grad_sync --> megatron/core/distributed/param_and_grad_buffer.py(175): check_grads --> megatron/core/rerun_state_machine.py(436): validate_result --> megatron/core/rerun_state_machine.py(874): _get_validation_call_info --> inspect.py(1677): getframeinfo --> inspect.py(1063): findsource --> linecache.py(52): checkcache --> <built-in function stat>
Specifically, despite using the mock dataset and no checkpointing, I'm noticing a stat call per AccumulateGrad call in the backprop.
Question:
What triggers a filesystem stat during the backward pass when using --mock-data? We'd expect no filesystem operations during step execution with mock data and would like to understand the purpose of this call (e.g. logging, checkpoint detection, tokenizer, something else).