Skip to content

[BUG] janus t2i finetune error #187

@hl0737

Description

@hl0737

Required prerequisites

What version of align-anything are you using?

0.0.1.dev0

System information

官方安装流程,单节点8卡a800训练,128个训练样本,首先使用pre tokenizer转换为pt文件,把pt文件路径放到sft的训练脚本中,启动训练

MODEL_NAME_OR_PATH="/mnt/data_vlm/models/Janus-Pro-1B"
TRAIN_DATASETS="/mnt/data_vlm/liang.hu/janus_train/sft_imagegen_data/"
TRAIN_DATA_FILE="train.pt"
OUTPUT_DIR="/mnt/data_vlm/liang.hu/janus_train/output/"
JANUS_REPO_PATH="/mnt/data_vlm/liang.hu/Janus"

export PYTHONPATH=$PYTHONPATH:$JANUS_REPO_PATH
export WANDB_API_KEY="xxx"
export WANDB_PROJECT="Janus"
export WANDB_NAME="xxx"
export WANDB_MODE=online

Source the setup script

source ./setup.sh

Execute deepspeed command

deepspeed
--master_port ${MASTER_PORT}
--module align_anything.trainers.janus.sft
--model_name_or_path ${MODEL_NAME_OR_PATH}
--train_datasets ${TRAIN_DATASETS}
--train_data_files ${TRAIN_DATA_FILE}
--train_split train
--learning_rate 2e-5
--epochs 1
--weight_decay 0.1
--adam_beta1 0.9
--adam_beta2 0.95
--lr_scheduler_type constant
--output_dir ${OUTPUT_DIR}

Problem description

报错

[rank1]: Traceback (most recent call last):
[rank1]: File "", line 198, in _run_module_as_main
[rank1]: File "", line 88, in _run_code
[rank1]: File "/mnt/data_vlm/liang.hu/align-anything/align_anything/trainers/janus/sft.py", line 118, in
[rank1]: sys.exit(main())
[rank1]: ^^^^^^
[rank1]: File "/mnt/data_vlm/liang.hu/align-anything/align_anything/trainers/janus/sft.py", line 113, in main
[rank1]: trainer.train()
[rank1]: File "/mnt/data_vlm/liang.hu/align-anything/align_anything/trainers/text_to_text/sft.py", line 143, in train
[rank1]: info = self.train_step(batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/data_vlm/liang.hu/align-anything/align_anything/trainers/text_to_text/sft.py", line 102, in train_step
[rank1]: loss = self.loss(sft_batch)['loss']
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/data_vlm/liang.hu/align-anything/align_anything/trainers/janus/sft.py", line 82, in loss
[rank1]: outputs = self.model.forward(**sft_batch, task=sft_batch['task'])
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: TypeError: deepspeed.utils.nvtx.instrument_w_nvtx..wrapped_fn() got multiple values for keyword argument 'task'

Reproducible example code

The Python snippets:

Command lines:

Extra dependencies:


Steps to reproduce:

Traceback

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions