-
Notifications
You must be signed in to change notification settings - Fork 473
Open
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't workingexternalIssues/PR filed by people outside the teamIssues/PR filed by people outside the team
Description
Version
25.06
On which installation method(s) does this occur?
Docker
Describe the issue
Before Commit 1046
Prior data was loaded lazily. I managed the following two (FULL) training rounds of
- 25 epochs each, batch_size = 8, using docker (as recommended)
Machines and times - Google Cloud VM machines (
a2-highgpu-2g (24 vCPUs, 170 GB Memory), 2 x NVIDIA A100 40GB). It took ~22 hrs - Local machine,
2 x NVIDIA 3090. It took ~20 hrs
Command sequence
In both cases, there was no pre-processing, and the command was:
torchrun --nproc-per-node=2 train.py`After Commit 1046
Command sequence
python preprocessor.py
mpirun -np 2 --allow-run-as-root python train.py
Running the pre-processor step using 1 GPU took ~4hrs
Note
- Only managed 1 epocj, batch_size = 1, using docker (as recommended) - Large batch sizes threw an out of memory error.
- After the commit above, 1 epoch = ~65 hrs, and fails on attempting 2nd epoch (see log below).
- It also requires significantly more storage and power (GPU).
Storage issues
preprocess creates samples and dataset grows 10x and takes ~4hrs to complete the preprocessing step.
One Epoch logs
Same epoch, just split output to track progress.
Epoch 1/30: 1%| | 676/99500 [22:53<61:02:25, 2.22s/it, loss=1.071e-01]
Epoch 1/30: 4%|▎ | 3544/99500 [1:58:57<43:40:05, 1.64s/it, loss=5.382e-02]
Epoch 1/30: 4%|▍ | 4323/99500 [2:26:46<35:39:54, 1.35s/it, loss=1.019e-01]
Epoch 1/30: 11%|█▏ | 11382/99500 [6:54:29<34:50:37, 1.42s/it, loss=3.101e-02]
Epoch 1/30: 49%|████▊ | 48291/99500 [31:27:53<20:01:57, 1.41s/it, loss=1.222e-02]
Epoch 1/30: 52%|█████▏ | 51636/99500 [33:40:08<18:42:41, 1.41s/it, loss=3.094e-02]
Epoch 1/30: 90%|█████████ | 89603/99500 [59:03:40<3:37:57, 1.32s/it, loss=3.355e-02]
[2025-08-16 02:06:09,956][main][INFO] - epoch: 1, loss: 3.514e-02, time per epoch: 2.365e+05
When attempting a 2nd epoch, it throws an error with AttributeError: module 'physicsnemo' has no attribute '__version__'
Minimum reproducible example
Attempt: [deforming plate example)](https://github.com/NVIDIA/physicsnemo/tree/main/examples/structural_mechanics/deforming_plate)Relevant log output
root@85090d365d83:/workspace/physicsnemo/examples/structural_mechanics/deforming_plate# mpirun -np 2 --allow-run-as-root python train.py
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/meshgraphnet.py:45: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/meshgraphnet.py:45: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
/workspace/physicsnemo/physicsnemo/models/gnn_layers/utils.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
/workspace/physicsnemo/physicsnemo/models/gnn_layers/utils.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/hybrid_meshgraphnet.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
/workspace/physicsnemo/physicsnemo/models/meshgraphnet/hybrid_meshgraphnet.py:43: UserWarning: MeshGraphNet will soon require PyTorch Geometric and torch_scatter.
Install it from here:
https://github.com/rusty1s/pytorch_scatter
warnings.warn(
Found 1000 sample files, 199000 samples in total.
Found 1000 sample files, 199000 samples in total.
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:176: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = GradScaler()
[2025-08-13 08:25:07,133][main][INFO] - Using FusedAdam optimizer
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:176: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = GradScaler()
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
Epoch 1/30: 0%| | 0/99500 [00:00<?, ?it/s][2025-08-13 08:25:07,136][main][INFO] - Training started...
Epoch 1/30: 0%| | 0/99500 [00:00<?, ?it/s]/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with autocast(enabled=self.amp):
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with autocast(enabled=self.amp):
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
[2025-08-13 08:25:07,135][checkpoint][WARNING] - Provided checkpoint directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, skipping load
Epoch 1/30: 0%| | 0/99500 [00:00<?, ?it/s][2025-08-13 08:25:07,136][main][INFO] - Training started...
Epoch 1/30: 0%| | 0/99500 [00:00<?, ?it/s]/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with autocast(enabled=self.amp):
/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py:204: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with autocast(enabled=self.amp):
Epoch 1/30: 1%| | 676/99500 [22:53<61:02:25, 2.22s/it, loss=1.071e-01]
Epoch 1/30: 4%|▎ | 3544/99500 [1:58:57<43:40:05, 1.64s/it, loss=5.382e-02]
Epoch 1/30: 4%|▍ | 4323/99500 [2:26:46<35:39:54, 1.35s/it, loss=1.019e-01]
Epoch 1/30: 11%|█▏ | 11382/99500 [6:54:29<34:50:37, 1.42s/it, loss=3.101e-02]
Epoch 1/30: 49%|████▊ | 48291/99500 [31:27:53<20:01:57, 1.41s/it, loss=1.222e-02]
Epoch 1/30: 52%|█████▏ | 51636/99500 [33:40:08<18:42:41, 1.41s/it, loss=3.094e-02]
Epoch 1/30: 90%|█████████ | 89603/99500 [59:03:40<3:37:57, 1.32s/it, loss=3.355e-02]
[2025-08-16 02:06:09,956][main][INFO] - epoch: 1, loss: 3.514e-02, time per epoch: 2.365e+05
Epoch 2/30: 0%| | 0/99500 [00:00<?, ?it/s][2025-08-16 02:06:09,990][checkpoint][WARNING] - Output directory /workspace/physicsnemo/examples/structural_mechanics/deforming_plate/checkpoints does not exist, will attempt to create
Error executing job with overrides: []
Traceback (most recent call last):
File "/workspace/physicsnemo/examples/structural_mechanics/deforming_plate/train.py", line 263, in main
save_checkpoint(
File "/workspace/physicsnemo/physicsnemo/launch/utils/checkpoint.py", line 261, in save_checkpoint
model.save(file_name)
File "/workspace/physicsnemo/physicsnemo/models/module.py", line 331, in save
"physicsnemo_version": physicsnemo.__version__,
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'physicsnemo' has no attribute '__version__'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[rank1]:[E816 02:16:16.053330842 ProcessGroupNCCL.cpp:633] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199007, OpType=ALLREDUCE, NumelIn=264836, NumelOut=264836, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.
[rank1]:[E816 02:16:16.063174528 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 199007 PG status: last enqueued work: 199008, last completed work: 199006
[rank1]:[E816 02:16:16.063213378 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E816 02:16:16.063340598 ProcessGroupNCCL.cpp:2104] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank1]:[E816 02:16:16.592171709 ProcessGroupNCCL.cpp:1744] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 199008, last completed NCCL work: 199006.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E816 02:16:16.592309001 ProcessGroupNCCL.cpp:1683] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E816 02:16:16.593015118 ProcessGroupNCCL.cpp:1534] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank0]:[E816 02:16:16.593326699 ProcessGroupNCCL.cpp:1744] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 199007, last completed NCCL work: 199006.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E816 02:16:16.593460313 ProcessGroupNCCL.cpp:1534] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E816 02:16:16.598345676 ProcessGroupNCCL.cpp:685] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E816 02:16:16.598433010 ProcessGroupNCCL.cpp:633] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199008, OpType=ALLREDUCE, NumelIn=3340800, NumelOut=3340800, Timeout(ms)=600000) ran for 600586 milliseconds before timing out.
[rank1]:[E816 02:16:16.598464103 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 199008 PG status: last enqueued work: 199008, last completed work: 199007
[rank1]:[E816 02:16:16.598482263 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E816 02:16:16.598514606 ProcessGroupNCCL.cpp:685] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E816 02:16:17.898844648 ProcessGroupNCCL.cpp:1807] [PG ID 0 PG GUID 0(default_pg) Rank 1] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[rank0]:[E816 02:16:17.233455775 ProcessGroupNCCL.cpp:633] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199007, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600050 milliseconds before timing out.
[rank0]:[E816 02:16:17.233886511 ProcessGroupNCCL.cpp:2269] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 199007 PG status: last enqueued work: 199007, last completed work: 199006
[rank0]:[E816 02:16:17.233904629 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E816 02:16:17.233942924 ProcessGroupNCCL.cpp:685] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[F816 02:24:16.599411828 ProcessGroupNCCL.cpp:1555] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
[rank1]:[F816 02:24:17.898983473 ProcessGroupNCCL.cpp:1555] [PG ID 0 PG GUID 0(default_pg) Rank 1] [PG ID 0 PG GUID 0(default_pg) Rank 1] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 85090d365d83 exited on signal 6 (Aborted).
--------------------------------------------------------------------------Environment details
- Using GCP VM with docker container: `docker pull nvcr.io/nvidia/physicsnemo/physicsnemo:25.06`
**Container**
docker run --name pn_container --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
--runtime nvidia -v ${PWD}:/workspace \
--gpus all \
-m 165g --memory-swap=-1 \
-it --rm nvcr.io/nvidia/physicsnemo/physicsnemo:25.06 bash
_Note: This setup worked before commit mentioned above_Metadata
Metadata
Assignees
Labels
? - Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't workingexternalIssues/PR filed by people outside the teamIssues/PR filed by people outside the team