Skip to content

🐛[BUG]: structural_mechanics / deforming_plate (MeshGraphNet). Training time dramatically increased after Aug 2025 update (commit 1046- Hybrid MeshGraphNet). Fails after one epoch #1073

@carlos-gen

Description

@carlos-gen

Version

25.06

On which installation method(s) does this occur?

Docker

Describe the issue

Before Commit 1046

Prior data was loaded lazily. I managed the following two (FULL) training rounds of

  • 25 epochs each, batch_size = 8, using docker (as recommended)
    Machines and times
  • Google Cloud VM machines (a2-highgpu-2g (24 vCPUs, 170 GB Memory), 2 x NVIDIA A100 40GB). It took ~22 hrs
  • Local machine, 2 x NVIDIA 3090. It took ~20 hrs

Command sequence
In both cases, there was no pre-processing, and the command was:

torchrun --nproc-per-node=2 train.py`

After Commit 1046

Command sequence

python preprocessor.py 
mpirun -np 2 --allow-run-as-root python train.py

Running the pre-processor step using 1 GPU took ~4hrs

Note

  • Only managed 1 epocj, batch_size = 1, using docker (as recommended) - Large batch sizes threw an out of memory error.
  • After the commit above, 1 epoch = ~65 hrs, and fails on attempting 2nd epoch (see log below).
  • It also requires significantly more storage and power (GPU).

Storage issues

preprocess creates samples and dataset grows 10x and takes ~4hrs to complete the preprocessing step.

One Epoch logs

Same epoch, just split output to track progress.

Epoch 1/30:   1%|          | 676/99500 [22:53<61:02:25,  2.22s/it, loss=1.071e-01] 
Epoch 1/30:   4%|▎         | 3544/99500 [1:58:57<43:40:05,  1.64s/it, loss=5.382e-02] 
Epoch 1/30:   4%|▍         | 4323/99500 [2:26:46<35:39:54,  1.35s/it, loss=1.019e-01] 
Epoch 1/30:  11%|█▏        | 11382/99500 [6:54:29<34:50:37,  1.42s/it, loss=3.101e-02] 
Epoch 1/30:  49%|████▊     | 48291/99500 [31:27:53<20:01:57,  1.41s/it, loss=1.222e-02]
Epoch 1/30:  52%|█████▏    | 51636/99500 [33:40:08<18:42:41,  1.41s/it, loss=3.094e-02]   
Epoch 1/30:  90%|█████████ | 89603/99500 [59:03:40<3:37:57,  1.32s/it, loss=3.355e-02] 
[2025-08-16 02:06:09,956][main][INFO] - epoch: 1, loss:  3.514e-02, time per epoch:  2.365e+05

When attempting a 2nd epoch, it throws an error with AttributeError: module 'physicsnemo' has no attribute '__version__'

Minimum reproducible example

Attempt: [deforming plate example)](https://github.com/NVIDIA/physicsnemo/tree/main/examples/structural_mechanics/deforming_plate)

Relevant log output

root@85090d365d83:/workspace/physicsnemo/examples/structural_mechanics/deforming_plate# mpirun -np 2 --allow-run-as-root python train.py
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/meshgraphnet.py:45: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/meshgraphnet.py:45: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
/workspace/physicsnemo/physicsnemo/models/gnn_layers/utils.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
/workspace/physicsnemo/physicsnemo/models/gnn_layers/utils.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/hybrid_meshgraphnet.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/hybrid_meshgraphnet.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter

  warnings.warn(
Found 1000 sample files, 199000 samples in total.
Found 1000 sample files, 199000 samples in total.
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:176: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler()
[2025-08-13 08:25:07,133][main][INFO] - Using FusedAdam optimizer
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:176: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = GradScaler()
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
Epoch 1/30:   0%|          | 0/99500 [00:00<?, ?it/s][2025-08-13 08:25:07,136][main][INFO] - Training started...
Epoch 1/30:   0%|          | 0/99500 [00:00<?, ?it/s]/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=self.amp):
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=self.amp):
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
Epoch 1/30:   0%|          | 0/99500 [00:00<?, ?it/s][2025-08-13 08:25:07,136][main][INFO] - Training started...
Epoch 1/30:   0%|          | 0/99500 [00:00<?, ?it/s]/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=self.amp):
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast(enabled=self.amp):
Epoch 1/30:   1%|          | 676/99500 [22:53<61:02:25,  2.22s/it, loss=1.071e-01] 
Epoch 1/30:   4%|| 3544/99500 [1:58:57<43:40:05,  1.64s/it, loss=5.382e-02] 
Epoch 1/30:   4%|| 4323/99500 [2:26:46<35:39:54,  1.35s/it, loss=1.019e-01] 
Epoch 1/30:  11%|█▏        | 11382/99500 [6:54:29<34:50:37,  1.42s/it, loss=3.101e-02] 
Epoch 1/30:  49%|████▊     | 48291/99500 [31:27:53<20:01:57,  1.41s/it, loss=1.222e-02]
Epoch 1/30:  52%|█████▏    | 51636/99500 [33:40:08<18:42:41,  1.41s/it, loss=3.094e-02]   
Epoch 1/30:  90%|█████████ | 89603/99500 [59:03:40<3:37:57,  1.32s/it, loss=3.355e-02]     
[2025-08-16 02:06:09,956][main][INFO] - epoch: 1, loss:  3.514e-02, time per epoch:  2.365e+05
Epoch 2/30:   0%|          | 0/99500 [00:00<?, ?it/s][2025-08-16 02:06:09,990][checkpoint][WARNING] - Output directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, will attempt to create
Error executing job with overrides: []
Traceback (most recent call last):
  File "/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py", line 263, in main
    save_checkpoint(
  File "/workspace/physicsnemo/physicsnemo/launch/utils/checkpoint.py", line 261, in save_checkpoint
    model.save(file_name)
  File "/workspace/physicsnemo/physicsnemo/models/module.py", line 331, in save
    "physicsnemo_version": physicsnemo.__version__,
                           ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'physicsnemo' has no attribute '__version__'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[rank1]:[E816 02:16:16.053330842 ProcessGroupNCCL.cpp:633] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199007, OpType=ALLREDUCE, NumelIn=264836, NumelOut=264836, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.
[rank1]:[E816 02:16:16.063174528 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 199007 PG status: last enqueued work: 199008, last completed work: 199006
[rank1]:[E816 02:16:16.063213378 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E816 02:16:16.063340598 ProcessGroupNCCL.cpp:2104] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank1]:[E816 02:16:16.592171709 ProcessGroupNCCL.cpp:1744] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 199008, last completed NCCL work: 199006.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank0]:[E816 02:16:16.592309001 ProcessGroupNCCL.cpp:1683] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E816 02:16:16.593015118 ProcessGroupNCCL.cpp:1534] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank0]:[E816 02:16:16.593326699 ProcessGroupNCCL.cpp:1744] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from  rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 199007, last completed NCCL work: 199006.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank0]:[E816 02:16:16.593460313 ProcessGroupNCCL.cpp:1534] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E816 02:16:16.598345676 ProcessGroupNCCL.cpp:685] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E816 02:16:16.598433010 ProcessGroupNCCL.cpp:633] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199008, OpType=ALLREDUCE, NumelIn=3340800, NumelOut=3340800, Timeout(ms)=600000) ran for 600586 milliseconds before timing out.
[rank1]:[E816 02:16:16.598464103 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 199008 PG status: last enqueued work: 199008, last completed work: 199007
[rank1]:[E816 02:16:16.598482263 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E816 02:16:16.598514606 ProcessGroupNCCL.cpp:685] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E816 02:16:17.898844648 ProcessGroupNCCL.cpp:1807] [PG ID 0 PG GUID 0(default_pg) Rank 1] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[E816 02:16:17.233455775 ProcessGroupNCCL.cpp:633] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199007, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank0]:[E816 02:16:17.233886511 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 199007 PG status: last enqueued work: 199007, last completed work: 199006
[rank0]:[E816 02:16:17.233904629 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E816 02:16:17.233942924 ProcessGroupNCCL.cpp:685] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[F816 02:24:16.599411828 ProcessGroupNCCL.cpp:1555] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
[rank1]:[F816 02:24:17.898983473 ProcessGroupNCCL.cpp:1555] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 85090d365d83 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Environment details

- Using GCP VM with docker container: `docker pull nvcr.io/nvidia/physicsnemo/physicsnemo:25.06`
**Container**

docker run --name pn_container --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
           --runtime nvidia -v ${PWD}:/workspace \
           --gpus all \
           -m 165g --memory-swap=-1 \
           -it --rm nvcr.io/nvidia/physicsnemo/physicsnemo:25.06 bash


_Note: This setup worked before commit mentioned above_

Metadata

Metadata

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't workingexternalIssues/PR filed by people outside the team

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions