-
Notifications
You must be signed in to change notification settings - Fork 13
Description
采用train_zero_t2i.sh SFT训练 10step,evaldation 出图异常,
accelerate launch --config_file ../configs/accelerate_config.yaml train.py
"${MODEL_ARGS[@]}"
"${OUTPUT_ARGS[@]}"
"${DATA_ARGS[@]}"
"${TRAIN_ARGS[@]}"
"${SYSTEM_ARGS[@]}"
"${CHECKPOINT_ARGS[@]}"
"${VALIDATION_ARGS[@]}"
compute_environment: LOCAL_MACHINE
gpu_ids: "5,6,7"
num_processes: 3 # should be the same as the number of GPUs
gpu_ids: "0"
num_processes: 1
debug: false
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_config_file: /home/CogView4/CogKit/quickstart/configs/zero/zero2.yaml # e.g. need use absolute path
zero3_init_flag: false
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29500
对比直接 执行 vscode debug run train.py 出图则正常,这个分布式训练 应该是有bug

