Open
Description
你好,我在尝试使用如下脚本训练模型:
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch
--nproc_per_node=2 main_deit.py
--model cf_deit_small
--batch-size 16
--data-path ImageNet/
--coarse-stage-size 9
--dist-eval
--output train_log
(cf-vit) zky_1@4090-03:~/codes/CF-ViT$ bash train.bash
/data/zky_1/.local/lib/python3.9/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
报错如下:
main()
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779]
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
[--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
[--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
[--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
[--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=0
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
[--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
[--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
[--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
[--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=1
W0925 03:19:27.336198 139999741781824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1807982 closing signal SIGTERM
E0925 03:19:27.400585 139999741781824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 1 (pid: 1807983) of binary: /data/anaconda3/envs/cf-vit/bin/python
Traceback (most recent call last):
File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 208, in
main()
File "/data/.local/lib/python3.9/site-packages/typing_extensions.py", line 2853, in wrapper
return arg(*args, **kwargs)
File "/data/local/lib/python3.9/site-packages/torch/distributed/launch.py", line 204, in main
launch(args)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in launch
run(args)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main_deit.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-09-25_03:19:27
host : 4090-03
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1807983)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
之后我在你的代码中添加了这样的一行代码解决(parser.add_argument("--local-rank")):
不过之后又遇到了新的问题,大致意思是训练的过程有些参数没有参数没有参与到计算过程中:
希望可以得到你的帮助,谢谢。
Metadata
Metadata
Assignees
Labels
No labels