Description
- I have searched related issues but cannot get the expected help. Yes
- The bug has not been fixed in the latest version. Yes
Describe the bug
A clear and concise description of what the bug is.
As the title suggests, when training via dist_train.sh I am not able to get past the evaluation stage of the training process while just running my model on a single GPU I am. This error pops up:
FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpm3f7a8xg/part_1.pkl'
Traceback (most recent call last):
Here is the full output:
_2022-10-26 20:56:24,265 - mmseg - INFO - Iter [190/160000] lr: 1.807e-07, eta: 2 days, 8:58:32, time: 1.229, data_time: 0.008, memory: 15900, decode.loss_cls: 5.1733, decode.loss_mask: 2.3283, decode.loss_dice: 4.2889, decode.d0.loss_cls: 10.3644, decode.d0.loss_mask: 1.8888, decode.d0.loss_dice: 3.6292, decode.d1.loss_cls: 5.3732, decode.d1.loss_mask: 1.9898, decode.d1.loss_dice: 3.6457, decode.d2.loss_cls: 5.0473, decode.d2.loss_mask: 1.9692, decode.d2.loss_dice: 3.7876, decode.d3.loss_cls: 5.1667, decode.d3.loss_mask: 1.9582, decode.d3.loss_dice: 3.9707, decode.d4.loss_cls: 5.1105, decode.d4.loss_mask: 2.0531, decode.d4.loss_dice: 4.0650, decode.d5.loss_cls: 4.9984, decode.d5.loss_mask: 2.2159, decode.d5.loss_dice: 4.0660, decode.d6.loss_cls: 4.9207, decode.d6.loss_mask: 2.2972, decode.d6.loss_dice: 4.1253, decode.d7.loss_cls: 4.8610, decode.d7.loss_mask: 2.2477, decode.d7.loss_dice: 4.1793, decode.d8.loss_cls: 4.9706, decode.d8.loss_mask: 2.3507, decode.d8.loss_dice: 4.2628, loss: 117.3053
2022-10-26 20:56:37,293 - mmseg - INFO - Iter [195/160000] lr: 1.855e-07, eta: 2 days, 8:54:33, time: 1.227, data_time: 0.008, memory: 15900, decode.loss_cls: 4.7057, decode.loss_mask: 2.3921, decode.loss_dice: 4.3334, decode.d0.loss_cls: 10.3254, decode.d0.loss_mask: 1.9261, decode.d0.loss_dice: 3.6466, decode.d1.loss_cls: 4.9439, decode.d1.loss_mask: 1.9303, decode.d1.loss_dice: 3.7556, decode.d2.loss_cls: 4.6303, decode.d2.loss_mask: 1.9778, decode.d2.loss_dice: 3.7929, decode.d3.loss_cls: 4.7008, decode.d3.loss_mask: 2.0036, decode.d3.loss_dice: 3.8613, decode.d4.loss_cls: 4.7067, decode.d4.loss_mask: 2.1614, decode.d4.loss_dice: 3.9415, decode.d5.loss_cls: 4.5411, decode.d5.loss_mask: 2.4803, decode.d5.loss_dice: 4.0049, decode.d6.loss_cls: 4.5642, decode.d6.loss_mask: 2.5168, decode.d6.loss_dice: 4.0441, decode.d7.loss_cls: 4.5295, decode.d7.loss_mask: 2.4627, decode.d7.loss_dice: 4.0911, decode.d8.loss_cls: 4.6387, decode.d8.loss_mask: 2.3198, decode.d8.loss_dice: 4.2091, loss: 114.1377
2022-10-26 20:56:50,328 - mmseg - INFO - Saving checkpoint at 200 iterations
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 10/10, 0.1 task/s, elapsed: 116s, ETA: 0sTraceback (most recent call last):
File "./train.py", line 224, in
main()
File "./train.py", line 213, in main
train_segmentor(
File "/home/ubuntu/.local/lib/python3.8/site-packages/mmseg/apis/train.py", line 167, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 67, in train
self.call_hook('after_train_iter')
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 309, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/evaluation.py", line 259, in after_train_iter
hook.after_train_iter(runner)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 129, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/progress_tracking/ViT-Adapter/mmseg_custom/core/hook/wandblogger_hook.py", line 206, in after_train_iter
results = self.test_fn(runner.model, self.eval_hook.dataloader)
File "/home/ubuntu/.local/lib/python3.8/site-packages/mmseg/apis/test.py", line 232, in multi_gpu_test
results = collect_results_cpu(results, len(dataset), tmpdir)
File "/usr/local/lib/python3.8/dist-packages/mmcv/engine/test.py", line 139, in collect_results_cpu
part_result = mmcv.load(part_file)
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/io.py", line 60, in load
with BytesIO(file_client.get(file)) as f:
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 993, in get
return self.client.get(filepath)
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 518, in get
with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpm3f7a8xg/part_1.pkl'
Traceback (most recent call last):_
Reproduction
- What command or script did you run?
CUDA_VISIBLE_DEVICES=0,1 PORT=29500 ./dist_train.sh configs/ade20k/mask2former_beit_adapter_large_640_160k_ade20k_ms.py 2
- Did you make any modifications on the code or config? Did you understand what you have modified?
No.
Here is the config I'm using:
_base = [
'../base/models/mask2former_beit.py',
'../base/datasets/ade20k.py',
'../base/default_runtime.py',
'../base/schedules/schedule_160k.py'
]
crop_size = (640, 640)
pretrained = 'https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_large_patch16_224_pt22k_ft22k.pth'
model = dict(
pretrained=pretrained,
backbone=dict(
type='BEiTAdapter',
img_size=640,
patch_size=16,
embed_dim=1024,
depth=24,
num_heads=16,
mlp_ratio=4,
qkv_bias=True,
use_abs_pos_emb=False,
use_rel_pos_bias=True,
init_values=1e-6,
drop_path_rate=0.3,
conv_inplane=64,
n_points=4,
deform_num_heads=16,
cffn_ratio=0.25,
deform_ratio=0.5,
with_cp=True, # set with_cp=True to save memory
interaction_indexes=[[0, 5], [6, 11], [12, 17], [18, 23]],
),
decode_head=dict(
in_channels=[1024, 1024, 1024, 1024],
feat_channels=1024,
out_channels=1024,
num_queries=100,
pixel_decoder=dict(
type='MSDeformAttnPixelDecoder',
num_outs=3,
norm_cfg=dict(type='GN', num_groups=32),
act_cfg=dict(type='ReLU'),
encoder=dict(
type='DetrTransformerEncoder',
num_layers=6,
transformerlayers=dict(
type='BaseTransformerLayer',
attn_cfgs=dict(
type='MultiScaleDeformableAttention',
embed_dims=1024,
num_heads=32,
num_levels=3,
num_points=4,
im2col_step=64,
dropout=0.0,
batch_first=False,
norm_cfg=None,
init_cfg=None),
ffn_cfgs=dict(
type='FFN',
embed_dims=1024,
feedforward_channels=4096,
num_fcs=2,
ffn_drop=0.0,
with_cp=True, # set with_cp=True to save memory
act_cfg=dict(type='ReLU', inplace=True)),
operation_order=('self_attn', 'norm', 'ffn', 'norm')),
init_cfg=None),
positional_encoding=dict(
type='SinePositionalEncoding', num_feats=512, normalize=True),
init_cfg=None),
positional_encoding=dict(
type='SinePositionalEncoding', num_feats=512, normalize=True),
transformer_decoder=dict(
type='DetrTransformerDecoder',
return_intermediate=True,
num_layers=9,
transformerlayers=dict(
type='DetrTransformerDecoderLayer',
attn_cfgs=dict(
type='MultiheadAttention',
embed_dims=1024,
num_heads=32,
attn_drop=0.0,
proj_drop=0.0,
dropout_layer=None,
batch_first=False),
ffn_cfgs=dict(
embed_dims=1024,
feedforward_channels=4096,
num_fcs=2,
act_cfg=dict(type='ReLU', inplace=True),
ffn_drop=0.0,
dropout_layer=None,
with_cp=True, # set with_cp=True to save memory
add_identity=True),
feedforward_channels=4096,
operation_order=('cross_attn', 'norm', 'self_attn', 'norm',
'ffn', 'norm')),
init_cfg=None)
),
test_cfg=dict(mode='slide', crop_size=crop_size, stride=(426, 426))
)
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 640), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
dict(type='ToMask'),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 640),
img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
flip=True,
transforms=[
dict(type='SETR_Resize', keep_ratio=True,
crop_size=crop_size, setr_multi_scale=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
optimizer = dict(delete=True, type='AdamW', lr=2e-5, betas=(0.9, 0.999), weight_decay=0.05,
constructor='LayerDecayOptimizerConstructor',
paramwise_cfg=dict(num_layers=24, layer_decay_rate=0.90))
lr_config = dict(delete=True,
policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-6,
power=1.0, min_lr=0.0, by_epoch=False)
log_config = dict(
interval=5,
hooks=[
dict(type='MMSegWandbHook',
init_kwargs={
'entity': "nexterarobotics",
'project': "Progress_Tracking_V1",
'name': "mask2former_beit_adapter_large_896_80k_ade20k_ss_V0.1"},
by_epoch=False,
num_eval_images = 2),
dict(type='TextLoggerHook', by_epoch=False),
])
data = dict(samples_per_gpu=1,
train=dict(pipeline=train_pipeline),
val=dict(pipeline=test_pipeline),
test=dict(pipeline=test_pipeline))
runner = dict(type='IterBasedRunner')
checkpoint_config = dict(by_epoch=False, interval=100, max_keep_ckpts=1)
evaluation = dict(interval=200, metric='mIoU', save_best='mIoU')_
- What dataset did you use?
ADE20K
- Please run
python mmseg/utils/collect_env.py
to collect necessary environment information and paste it here.
python3 collect_env.py
sys.platform: linux
Python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A10G
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.9.0+cu111
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.10.0+cu111
OpenCV: 4.6.0
MMCV: 1.5.0
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.6
MMSegmentation: 0.20.2+ad38cbe