Train Script

你好，我在尝试使用如下脚本训练模型：
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch \
    --nproc_per_node=2 main_deit.py  \
    --model cf_deit_small \
    --batch-size 16 \
    --data-path ImageNet/ \
    --coarse-stage-size 9 \
    --dist-eval \
    --output train_log

(cf-vit) zky_1@4090-03:~/codes/CF-ViT$ bash train.bash
/data/zky_1/.local/lib/python3.9/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions
报错如下：
  main()
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] 
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
                                           [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
                                           [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
                                           [--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
                                           [--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
                                           [--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
                                           [--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=0
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
                                           [--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
                                           [--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
                                           [--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
                                           [--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
                                           [--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
                                           [--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=1
W0925 03:19:27.336198 139999741781824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1807982 closing signal SIGTERM
E0925 03:19:27.400585 139999741781824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 1 (pid: 1807983) of binary: /data/anaconda3/envs/cf-vit/bin/python
Traceback (most recent call last):
  File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 208, in <module>
    main()
  File "/data/.local/lib/python3.9/site-packages/typing_extensions.py", line 2853, in wrapper
    return arg(*args, **kwargs)
  File "/data/local/lib/python3.9/site-packages/torch/distributed/launch.py", line 204, in main
    launch(args)
  File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in launch
    run(args)
  File "/data/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main_deit.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-25_03:19:27
  host      : 4090-03
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 1807983)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
之后我在你的代码中添加了这样的一行代码解决(parser.add_argument("--local-rank"))：
<img width="720" alt="image" src="https://github.com/user-attachments/assets/d12651d1-2667-4fde-a4c4-92b156aa28d9">
不过之后又遇到了新的问题，大致意思是训练的过程有些参数没有参数没有参与到计算过程中：
<img width="1639" alt="image" src="https://github.com/user-attachments/assets/9232c529-15b7-4ecf-ad6d-52bcac0e1cf4">

希望可以得到你的帮助，谢谢。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train Script #12

main_deit.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-25_03:19:27
host : 4090-03
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1807983)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Train Script #12

Description

main_deit.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-09-25_03:19:27 host : 4090-03 rank : 1 (local_rank: 1) exitcode : 2 (pid: 1807983) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-25_03:19:27
host : 4090-03
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1807983)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html