Closed
Description
Hi, I'm running into the following error when attempting to train with the deepspeed transformer kernel.
This error occurs during the forward pass of the first training step.
!!!! kernel execution error.
is printed for 80 lines, followed by this traceback:
Traceback (most recent call last):
File "RunPretrain.py", line 127, in <module>
deepspeed_train.main(args)
File "[REDACTED]/deepspeed_train.py", line 751, in main
run(args, model, optimizer, start_epoch)
File "[REDACTED]/deepspeed_train.py", line 700, in run
train(args, index, model, optimizer)
File "[REDACTED]/deepspeed_train.py", line 329, in train
loss = model.network(batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 689, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "[REDACTED]/nvidia/modeling.py", line 1110, in forward
checkpoint_activations=checkpoint_activations)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "[REDACTED]/nvidia/modeling.py", line 1025, in forward
pooled_output = self.pooler(sequence_output)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "[REDACTED]/nvidia/modeling.py", line 635, in forward
pooled_output = self.dense_act(first_token_tensor)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "[REDACTED]/nvidia/modeling.py", line 207, in forward
return bias_tanh(self.bias, F.linear(input, self.weight, None))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1371, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
If I disable the use of the deepspeed transformer kernel, everything works just fine.
I'm using a slightly modified version of the provided dockerfile:
FROM nvidia/cuda:10.0-devel-ubuntu18.04
##############################################################################
# Installation/Basic Utilities
##############################################################################
RUN apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common \
openssh-client openssh-server \
pdsh curl sudo net-tools \
vim iputils-ping wget
##############################################################################
# Installation Latest Git
##############################################################################
RUN add-apt-repository ppa:git-core/ppa -y && \
apt-get update && \
apt-get install -y git && \
git --version
##############################################################################
# Python
##############################################################################
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHON_VERSION=3
RUN apt-get install -y python3 python3-dev && \
rm -f /usr/bin/python && \
ln -s /usr/bin/python3 /usr/bin/python && \
curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py && \
pip install --upgrade pip && \
# Print python an pip version
python -V && pip -V
##############################################################################
# MXNet
##############################################################################
ENV MXNET_VERSION=1.5.0
RUN pip install mxnet-cu100==${MXNET_VERSION}
##############################################################################
# TensorFlow
##############################################################################
ENV TENSORFLOW_VERSION=1.15.2
RUN pip install tensorflow-gpu==${TENSORFLOW_VERSION}
##############################################################################
# PyTorch
##############################################################################
ENV PYTORCH_VERSION=1.2.0
ENV TORCHVISION_VERSION=0.4.0
ENV TENSORBOARDX_VERSION=1.8
RUN pip install torch==${PYTORCH_VERSION}
RUN pip install torchvision==${TORCHVISION_VERSION}
RUN pip install tensorboardX==${TENSORBOARDX_VERSION}
##############################################################################
# Temporary Installation Directory
##############################################################################
ENV STAGE_DIR=/tmp
RUN mkdir -p ${STAGE_DIR}
##############################################################################
# Mellanox OFED
##############################################################################
ENV MLNX_OFED_VERSION=4.6-1.0.1.1
RUN apt-get install -y libnuma-dev
RUN cd ${STAGE_DIR} && \
wget -q -O - http://www.mellanox.com/downloads/ofed/MLNX_OFED-${MLNX_OFED_VERSION}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64.tgz | tar xzf - && \
cd MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64 && \
./mlnxofedinstall --user-space-only --without-fw-update --all -q && \
cd ${STAGE_DIR} && \
rm -rf ${STAGE_DIR}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64*
##############################################################################
# nv_peer_mem
##############################################################################
RUN mkdir -p ${STAGE_DIR} && \
git clone https://github.com/Mellanox/nv_peer_memory.git ${STAGE_DIR}/nv_peer_memory && \
cd ${STAGE_DIR}/nv_peer_memory && \
./build_module.sh && \
cd ${STAGE_DIR} && \
tar xzf ${STAGE_DIR}/nvidia-peer-memory_1.0.orig.tar.gz && \
cd ${STAGE_DIR}/nvidia-peer-memory-1.0 && \
apt-get install -y dkms && \
dpkg-buildpackage -us -uc && \
dpkg -i ${STAGE_DIR}/nvidia-peer-memory_1.0-9_all.deb
##############################################################################
# Install OpenMPI
##############################################################################
RUN mkdir -p ${STAGE_DIR}/openmpi && \
cd ${STAGE_DIR}/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.1.tar.gz && \
tar zxf openmpi-4.0.1.tar.gz && \
cd openmpi-4.0.1 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf ${STAGE_DIR}/openmpi
##############################################################################
# Ucomment and set SSH Daemon port
##############################################################################
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
ENV SSH_PORT=2222
RUN cat /etc/ssh/sshd_config > ${STAGE_DIR}/sshd_config && \
sed "0,/^#Port 22/s//Port ${SSH_PORT}/" ${STAGE_DIR}/sshd_config > /etc/ssh/sshd_config
##############################################################################
# DeepSpeed
##############################################################################
RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
RUN cd ${STAGE_DIR}/DeepSpeed && \
git checkout . && \
git checkout master && \
./install.sh --allow_sudo --pip_sudo
RUN rm -rf ${STAGE_DIR}/DeepSpeed
RUN python -c "import deepspeed; print(deepspeed.__version__)"
##############################################################################
# Install Additional Python Libs
##############################################################################
RUN pip install future typing
RUN pip install numpy scipy pandas h5py tqdm \
scikit-learn pytest boto3 filelock \
tokenizers requests regex mpi4py dill
##############################################################################
# Install Horovod
##############################################################################
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL \
HOROVOD_GPU_BROADCAST=NCCL \
HOROVOD_WITH_TENSORFLOW=1 \
HOROVOD_WITH_PYTORCH=1 \
HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig
##############################################################################
# Add-ons
##############################################################################
RUN pip install fastparquet
RUN pip install --no-cache-dir azureml-defaults
SHELL [ "/bin/bash", "-cu" ]
My deepspeed config looks like this:
{
"train_batch_size": 16384,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1000,
"prescale_gradients": false,
"optimizer": {
"type": "Lamb",
"params": {
"lr": 11e-3,
"weight_decay": 0.01,
"bias_correction": false,
"max_coeff": 0.3,
"min_coeff": 0.01
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0
}
}
This issue seems to indicate it may be a bug in the versions of Cuda and PyTorch that are used:
pytorch/pytorch#24018
And this one indicates that it may have to do with fp16 casting:
NVIDIA/apex#580
Any help would be appriciated! Thanks.
Metadata
Metadata
Assignees
Labels
No labels