Skip to content

SFT 训练推理结果异常 #36

@Vincento-Wang

Description

@Vincento-Wang

采用train_zero_t2i.sh SFT训练 10step,evaldation 出图异常,

accelerate launch --config_file ../configs/accelerate_config.yaml train.py
"${MODEL_ARGS[@]}"
"${OUTPUT_ARGS[@]}"
"${DATA_ARGS[@]}"
"${TRAIN_ARGS[@]}"
"${SYSTEM_ARGS[@]}"
"${CHECKPOINT_ARGS[@]}"
"${VALIDATION_ARGS[@]}"

compute_environment: LOCAL_MACHINE

gpu_ids: "5,6,7"
num_processes: 3 # should be the same as the number of GPUs

gpu_ids: "0"

num_processes: 1

debug: false

distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_config_file: /home/CogView4/CogKit/quickstart/configs/zero/zero2.yaml # e.g. need use absolute path
zero3_init_flag: false

downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
main_process_port: 29500

Image

对比直接 执行 vscode debug run train.py 出图则正常,这个分布式训练 应该是有bug

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions