[QUESTION] Why is there a `stat` call in the backprop while running `AccumulateGrad`?

**Your question**
Ask a clear and concise question about Megatron-LM. Tag the [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) 
to get oncall's attention to this issue.

Environment:
- Container: nvcr.io/nvidia/pytorch:25.06-py3
- Model: Nemotron-5 8B proxy
- GPUs: 8 x B200 (single node)
- Parallelism: TP=1, PP=1, DP=8, distributed optimizer with --overlap-grad-reduce and --overlap-param-gather
- Data: `--mock-data` with `--num-workers` 1
- Attention: `--attention-backend` flash, `NVTE_FUSED_ATTN=0`
- Platform: Slurm cluster 

While running the [pretrain_gpt](https://github.com/NVIDIA/Megatron-LM/blob/main/pretrain_gpt.py) example with a mock dataset (`--mock-data`), we enabled the PyTorch profiler via the Megatron CLI flags
```
--use-pytorch-profiler
--profile
--profile-step-start 0
--profile-step-end 5
--tensorboard-dir ${RESULT_DIR}/profiler_traces
```

We then loaded the resulting .pt.trace.json files in TensorBoard's PyTorch Profiler plugin to investigate a performance issue we were seeing (not entirely relevant to this GH Issue).

As part of our investigation, we zeroed in on the `autograd::engine::evaluate_function:torch::autograd::AccumulateGrad` kernel op, whose stack trace looks like
```
autograd::engine::evaluate_function:torch::autograd::AccumulateGrad --> megatron/core/distributed/distributed_data_parallel.py(448): hook --> megatron/core/distributed/param_and_grad_buffer.py(472): register_grad_ready --> megatron/core/distributed/param_and_grad_buffer.py(322): start_grad_sync --> megatron/core/distributed/param_and_grad_buffer.py(175): check_grads --> megatron/core/rerun_state_machine.py(436): validate_result --> megatron/core/rerun_state_machine.py(874): _get_validation_call_info --> inspect.py(1677): getframeinfo --> inspect.py(1063): findsource --> linecache.py(52): checkcache --> <built-in function stat>
```

<img width="1321" height="404" alt="Image" src="https://github.com/user-attachments/assets/69c64f56-0b14-4497-a562-2abf632bc7c2" />

Specifically, despite using the mock dataset and no checkpointing, I'm noticing a `stat` call per `AccumulateGrad` call in the backprop. 

Question:
What triggers a filesystem stat during the backward pass when using `--mock-data`? We'd expect no filesystem operations during step execution with mock data and would like to understand the purpose of this call (e.g. logging, checkpoint detection, tokenizer, something else).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why is there a `stat` call in the backprop while running `AccumulateGrad`? #3761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] Why is there a stat call in the backprop while running AccumulateGrad? #3761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[QUESTION] Why is there a `stat` call in the backprop while running `AccumulateGrad`? #3761