Skip to content

[QUESTION] Why is there a stat call in the backprop while running AccumulateGrad? #3761

@amanshanbhag

Description

@amanshanbhag

Your question
Ask a clear and concise question about Megatron-LM. Tag the @mcore-oncall
to get oncall's attention to this issue.

Environment:

  • Container: nvcr.io/nvidia/pytorch:25.06-py3
  • Model: Nemotron-5 8B proxy
  • GPUs: 8 x B200 (single node)
  • Parallelism: TP=1, PP=1, DP=8, distributed optimizer with --overlap-grad-reduce and --overlap-param-gather
  • Data: --mock-data with --num-workers 1
  • Attention: --attention-backend flash, NVTE_FUSED_ATTN=0
  • Platform: Slurm cluster

While running the pretrain_gpt example with a mock dataset (--mock-data), we enabled the PyTorch profiler via the Megatron CLI flags

--use-pytorch-profiler
--profile
--profile-step-start 0
--profile-step-end 5
--tensorboard-dir ${RESULT_DIR}/profiler_traces

We then loaded the resulting .pt.trace.json files in TensorBoard's PyTorch Profiler plugin to investigate a performance issue we were seeing (not entirely relevant to this GH Issue).

As part of our investigation, we zeroed in on the autograd::engine::evaluate_function:torch::autograd::AccumulateGrad kernel op, whose stack trace looks like

autograd::engine::evaluate_function:torch::autograd::AccumulateGrad --> megatron/core/distributed/distributed_data_parallel.py(448): hook --> megatron/core/distributed/param_and_grad_buffer.py(472): register_grad_ready --> megatron/core/distributed/param_and_grad_buffer.py(322): start_grad_sync --> megatron/core/distributed/param_and_grad_buffer.py(175): check_grads --> megatron/core/rerun_state_machine.py(436): validate_result --> megatron/core/rerun_state_machine.py(874): _get_validation_call_info --> inspect.py(1677): getframeinfo --> inspect.py(1063): findsource --> linecache.py(52): checkcache --> <built-in function stat>
Image

Specifically, despite using the mock dataset and no checkpointing, I'm noticing a stat call per AccumulateGrad call in the backprop.

Question:
What triggers a filesystem stat during the backward pass when using --mock-data? We'd expect no filesystem operations during step execution with mock data and would like to understand the purpose of this call (e.g. logging, checkpoint detection, tokenizer, something else).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions