Skip to content

[DOTry to use lora_finetune.py directly and get an error /bin/bash: line 1: export: `=/usr/bin/supervisord': not a valid identifier #6235

Open
@tanghl01

Description

@tanghl01

📚 The doc issue

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

install dependency

pip install -r requirements/requirements.txt

install colossalai

BUILD_EXT=1 pip install .
export CUDA_VISIBLE_DEVICES=0
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_INSTALL_DIR=/usr/local/cuda-12.4/
export CUDA_HOME=/usr/local/cuda-12.4/

colossalai check -i

Installation Report

------------ Environment ------------
Colossal-AI version: 0.4.8
PyTorch version: 2.5.1
System CUDA version: 12.4
CUDA version required by PyTorch: 12.4

Note:

  1. The table above checks the versions of the libraries/tools in the current environment
  2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
  3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:

  1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment variable BUILD_EXT=1 is set
  2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:

  1. The table above checks the version compatibility of the libraries/tools in the current environment
    • PyTorch version mismatch: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
    • System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
    • System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

#colossalai run --nproc_per_node 1 lora_finetune.py --pretrained "/root/autodl-tmp/DeepSeeK-R1-7B" --dataset "/root/converted_data.json" --quant 4 --lora_rank 32 --lora_alpha 64 --batch_size 8 --gradient_accumulation 2 --max_length 1024 --lr 1.5e-4 --warmup_steps 50 --num_epochs 3 --save_dir "/root/autodl-tmp/DeepSeeK_lora" --grad_ckpt --dtype bf16

/bin/bash: line 1: export: `=/usr/bin/supervisord': not a valid identifier
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 lora_finetune.py --pretrained /root/autodl-tmp/DeepSeeK-R1-7B --dataset /root/converted_data.json --quant 4 --lora_rank 32 --lora_alpha 64 --batch_size 8 --gradient_accumulation 2 --max_length 1024 --lr 1.5e-4 --warmup_steps 50 --num_epochs 3 --save_dir /root/autodl-tmp/DeepSeeK_lora --grad_ckpt --dtype bf16 on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /root/ColossalAI/applications/ColossalChat/examples/training_scripts && export ="/usr/bin/supervisord" SHELL="/bin/bash" NV_LIBCUBLAS_VERSION="12.4.5.8-1" NVIDIA_VISIBLE_DEVICES="GPU-866ac0d7-8995-0dd3-9bc5-6de16452ad15" NV_NVML_DEV_VERSION="12.4.127-1" NV_CUDNN_PACKAGE_NAME="libcudnn9-cuda-12" NV_LIBNCCL_DEV_PACKAGE="libnccl-dev=2.21.5-1+cuda12.4" CONDA_EXE="/root/miniconda3/bin/conda" NV_LIBNCCL_DEV_PACKAGE_VERSION="2.21.5-1" HOSTNAME="autodl-container-493b4c87d3-99a9c3d7" NVIDIA_REQUIRE_CUDA="cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536" NV_LIBCUBLAS_DEV_PACKAGE="libcublas-dev-12-4=12.4.5.8-1" NV_NVTX_VERSION="12.4.127-1" NV_CUDA_CUDART_DEV_VERSION="12.4.127-1" NV_LIBCUSPARSE_VERSION="12.3.1.170-1" NV_LIBNPP_VERSION="12.2.5.30-1" NCCL_VERSION="2.21.5-1" PWD="/root/ColossalAI/applications/ColossalChat/examples/training_scripts" AutoDLContainerUUID="493b4c87d3-99a9c3d7" CONDA_PREFIX="/root/miniconda3/envs/sft" NV_CUDNN_PACKAGE="libcudnn9-cuda-12=9.1.0.70-1" NVIDIA_DRIVER_CAPABILITIES="compute,utility,graphics,video" JUPYTER_SERVER_URL="http://autodl-container-493b4c87d3-99a9c3d7:8888/jupyter/" NV_NVPROF_DEV_PACKAGE="cuda-nvprof-12-4=12.4.127-1" NV_LIBNPP_PACKAGE="libnpp-12-4=12.2.5.30-1" NV_LIBNCCL_DEV_PACKAGE_NAME="libnccl-dev" TZ="Asia/Shanghai" NV_LIBCUBLAS_DEV_VERSION="12.4.5.8-1" NVIDIA_PRODUCT_NAME="CUDA" NV_LIBCUBLAS_DEV_PACKAGE_NAME="libcublas-dev-12-4" LINES="45" NV_CUDA_CUDART_VERSION="12.4.127-1" AutoDLServiceURL="https://u502097-87d3-99a9c3d7.nmb1.seetacloud.com:8443" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.webp=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:" COLUMNS="176" AutoDLRegion="nm-B1" CUDA_VERSION="12.4.1" AgentHost="172.29.52.64" NV_LIBCUBLAS_PACKAGE="libcublas-12-4=12.4.5.8-1" NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE="cuda-nsight-compute-12-4=12.4.1-1" CONDA_PROMPT_MODIFIER="(sft) " NV_LIBNPP_DEV_PACKAGE="libnpp-dev-12-4=12.2.5.30-1" NV_LIBCUBLAS_PACKAGE_NAME="libcublas-12-4" NV_LIBNPP_DEV_VERSION="12.2.5.30-1" JUPYTER_SERVER_ROOT="/root" TERM="xterm-256color" NV_LIBCUSPARSE_DEV_VERSION="12.3.1.170-1" LIBRARY_PATH="/usr/local/cuda/lib64/stubs" NV_CUDNN_VERSION="9.1.0.70-1" AutodlAutoPanelToken="jupyter-autodl-container-493b4c87d3-99a9c3d7-1f3f70c858d6c46d3975675baf8f3e103263f16190d504cfa848ca726f9077e18" CONDA_SHLVL="2" SHLVL="2" PYXTERM_DIMENSIONS="80x25" CUDA_INSTALL_DIR="/usr/local/cuda-12.4/" NV_CUDA_LIB_VERSION="12.4.1-1" NVARCH="x86_64" NV_CUDNN_PACKAGE_DEV="libcudnn9-dev-cuda-12=9.1.0.70-1" NV_CUDA_COMPAT_PACKAGE="cuda-compat-12-4" CONDA_PYTHON_EXE="/root/miniconda3/bin/python" NV_LIBNCCL_PACKAGE="libnccl2=2.21.5-1+cuda12.4" LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64" LC_CTYPE="C.UTF-8" CONDA_DEFAULT_ENV="sft" NV_CUDA_NSIGHT_COMPUTE_VERSION="12.4.1-1" REQUESTS_CA_BUNDLE="/etc/ssl/certs/ca-certificates.crt" OMP_NUM_THREADS="16" NV_NVPROF_VERSION="12.4.127-1" CUDA_HOME="/usr/local/cuda-12.4/" PATH="/root/miniconda3/envs/sft/bin:/root/miniconda3/condabin:/root/miniconda3/bin:/usr/local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NV_LIBNCCL_PACKAGE_NAME="libnccl2" NV_LIBNCCL_PACKAGE_VERSION="2.21.5-1" MKL_NUM_THREADS="16" CONDA_PREFIX_1="/root/miniconda3" DEBIAN_FRONTEND="noninteractive" OLDPWD="/root/ColossalAI" AutoDLDataCenter="neimengDC3" _="/root/miniconda3/envs/sft/bin/colossalai" CUDA_DEVICE_MAX_CONNECTIONS="1" && torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 lora_finetune.py --pretrained /root/autodl-tmp/DeepSeeK-R1-7B --dataset /root/converted_data.json --quant 4 --lora_rank 32 --lora_alpha 64 --batch_size 8 --gradient_accumulation 2 --max_length 1024 --lr 1.5e-4 --warmup_steps 50 --num_epochs 3 --save_dir /root/autodl-tmp/DeepSeeK_lora --grad_ckpt --dtype bf16'

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions