Skip to content

Training failed #100

@zska0913

Description

@zska0913

Hi, thanks for your great work!
While training stage-1, I have a problem. How can I solve it?

run command (VAD tiny stage_1 with nuscenes mini dataset):
python -m torch.distributed.run --nproc_per_node=8 --master_port=2333 tools/train.py projects/configs/VAD/VAD_tiny_stage_1.py --launcher pytorch --deterministic --work-dir ./data/output/

output:
....
2025-01-25 09:08:08,961 - mmdet - INFO - Saving checkpoint at 47 epochs
2025-01-25 09:09:10,091 - mmdet - INFO - Saving checkpoint at 48 epochs
[ ] 0/81, elapsed: 0s, ETA:/usr/local/lib/python3.8/dist-packages/torch/tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
...
self.post_center_range = torch.tensor(
/usr/local/lib/python3.8/dist-packages/torch/tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
[>>>> ] 8/81, 1.9 task/s, elapsed: 4s, ETA: 39s/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad
(True), rather than torch.tensor(sourceTensor).
self.post_center_range = torch.tensor(
/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad
(True), rather than torch.tensor(sourceTensor).
self.post_center_range = torch.tensor(
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 88/81, 14.5 task/s, elapsed: 6s, ETA: 0s

Traceback (most recent call last):
File "tools/train.py", line 266, in
main()
File "tools/train.py", line 255, in main
custom_train_model(
File "/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model
custom_train_detector(
File "/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate
eval_res = self.dataloader.dataset.evaluate(
File "/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate
all_metric_dict[key] += results[i]['metric_results'][key]
KeyError: 0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6263) of binary: /usr/bin/python
/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions