CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx when using deepspeed tranformer kernel

Hi, I'm running into the following error when attempting to train with the deepspeed transformer kernel.

This error occurs during the forward pass of the first training step.
`!!!! kernel execution error.` is printed for 80 lines, followed by this traceback:
```
Traceback (most recent call last):
  File "RunPretrain.py", line 127, in <module>
    deepspeed_train.main(args)
  File "[REDACTED]/deepspeed_train.py", line 751, in main
    run(args, model, optimizer, start_epoch)
  File "[REDACTED]/deepspeed_train.py", line 700, in run
    train(args, index, model, optimizer)
  File "[REDACTED]/deepspeed_train.py", line 329, in train
    loss = model.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 689, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "[REDACTED]/nvidia/modeling.py", line 1110, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "[REDACTED]/nvidia/modeling.py", line 1025, in forward
    pooled_output = self.pooler(sequence_output)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "[REDACTED]/nvidia/modeling.py", line 635, in forward
    pooled_output = self.dense_act(first_token_tensor)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "[REDACTED]/nvidia/modeling.py", line 207, in forward
    return bias_tanh(self.bias, F.linear(input, self.weight, None))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1371, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
```

If I disable the use of the deepspeed transformer kernel, everything works just fine.

I'm using a slightly modified version of the provided dockerfile:
```
FROM nvidia/cuda:10.0-devel-ubuntu18.04

##############################################################################
# Installation/Basic Utilities
##############################################################################
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    software-properties-common \
    openssh-client openssh-server \
    pdsh curl sudo net-tools \
    vim iputils-ping wget

##############################################################################
# Installation Latest Git
##############################################################################
RUN add-apt-repository ppa:git-core/ppa -y && \
    apt-get update && \
    apt-get install -y git && \
    git --version

##############################################################################
# Python
##############################################################################
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHON_VERSION=3
RUN apt-get install -y python3 python3-dev && \
    rm -f /usr/bin/python && \
    ln -s /usr/bin/python3 /usr/bin/python && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
        python get-pip.py && \
        rm get-pip.py && \
    pip install --upgrade pip && \
    # Print python an pip version
    python -V && pip -V

##############################################################################
# MXNet
##############################################################################
ENV MXNET_VERSION=1.5.0
RUN pip install mxnet-cu100==${MXNET_VERSION}

##############################################################################
# TensorFlow
##############################################################################
ENV TENSORFLOW_VERSION=1.15.2
RUN pip install tensorflow-gpu==${TENSORFLOW_VERSION}

##############################################################################
# PyTorch
##############################################################################
ENV PYTORCH_VERSION=1.2.0
ENV TORCHVISION_VERSION=0.4.0
ENV TENSORBOARDX_VERSION=1.8
RUN pip install torch==${PYTORCH_VERSION}
RUN pip install torchvision==${TORCHVISION_VERSION}
RUN pip install tensorboardX==${TENSORBOARDX_VERSION}

##############################################################################
# Temporary Installation Directory
##############################################################################
ENV STAGE_DIR=/tmp
RUN mkdir -p ${STAGE_DIR}

##############################################################################
# Mellanox OFED
##############################################################################
ENV MLNX_OFED_VERSION=4.6-1.0.1.1
RUN apt-get install -y libnuma-dev
RUN cd ${STAGE_DIR} && \
    wget -q -O - http://www.mellanox.com/downloads/ofed/MLNX_OFED-${MLNX_OFED_VERSION}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64.tgz | tar xzf - && \
    cd MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64 && \
    ./mlnxofedinstall --user-space-only --without-fw-update --all -q && \
    cd ${STAGE_DIR} && \
    rm -rf ${STAGE_DIR}/MLNX_OFED_LINUX-${MLNX_OFED_VERSION}-ubuntu18.04-x86_64*

##############################################################################
# nv_peer_mem
##############################################################################
RUN mkdir -p ${STAGE_DIR} && \
    git clone https://github.com/Mellanox/nv_peer_memory.git ${STAGE_DIR}/nv_peer_memory && \
    cd ${STAGE_DIR}/nv_peer_memory && \
    ./build_module.sh && \
    cd ${STAGE_DIR} && \
    tar xzf ${STAGE_DIR}/nvidia-peer-memory_1.0.orig.tar.gz && \
    cd ${STAGE_DIR}/nvidia-peer-memory-1.0 && \
    apt-get install -y dkms && \
    dpkg-buildpackage -us -uc && \
    dpkg -i ${STAGE_DIR}/nvidia-peer-memory_1.0-9_all.deb

##############################################################################
# Install OpenMPI
##############################################################################
RUN mkdir -p ${STAGE_DIR}/openmpi && \
    cd ${STAGE_DIR}/openmpi && \
    wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.1.tar.gz && \
    tar zxf openmpi-4.0.1.tar.gz && \
    cd openmpi-4.0.1 && \
    ./configure --enable-orterun-prefix-by-default && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf ${STAGE_DIR}/openmpi

##############################################################################
# Ucomment and set SSH Daemon port
##############################################################################
RUN mkdir -p /var/run/sshd
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

ENV SSH_PORT=2222
RUN cat /etc/ssh/sshd_config > ${STAGE_DIR}/sshd_config && \
    sed "0,/^#Port 22/s//Port ${SSH_PORT}/" ${STAGE_DIR}/sshd_config > /etc/ssh/sshd_config

##############################################################################
# DeepSpeed
##############################################################################
RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
RUN cd ${STAGE_DIR}/DeepSpeed && \
    git checkout . && \
    git checkout master && \
    ./install.sh --allow_sudo --pip_sudo
RUN rm -rf ${STAGE_DIR}/DeepSpeed
RUN python -c "import deepspeed; print(deepspeed.__version__)"

##############################################################################
# Install Additional Python Libs
##############################################################################
RUN pip install future typing
RUN pip install numpy scipy pandas h5py tqdm \
    scikit-learn pytest boto3 filelock \
    tokenizers requests regex mpi4py dill

##############################################################################
# Install Horovod
##############################################################################
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
    HOROVOD_GPU_ALLREDUCE=NCCL \
    HOROVOD_GPU_BROADCAST=NCCL \
    HOROVOD_WITH_TENSORFLOW=1 \
    HOROVOD_WITH_PYTORCH=1 \
    HOROVOD_WITH_MXNET=1          \
    pip install --no-cache-dir horovod && \
    ldconfig

##############################################################################
# Add-ons
##############################################################################
RUN pip install fastparquet
RUN pip install --no-cache-dir azureml-defaults

SHELL [ "/bin/bash", "-cu" ]
```

My deepspeed config looks like this:
```
{
  "train_batch_size": 16384,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Lamb",
    "params": {
      "lr": 11e-3,
      "weight_decay": 0.01,
      "bias_correction": false,
      "max_coeff": 0.3,
      "min_coeff": 0.01
    }
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "fp16": {
    "enabled": true,
    "loss_scale": 0
  }
}
```

This issue seems to indicate it may be a bug in the versions of Cuda and PyTorch that are used:
https://github.com/pytorch/pytorch/issues/24018

And this one indicates that it may have to do with fp16 casting:
https://github.com/NVIDIA/apex/issues/580

Any help would be appriciated! Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx when using deepspeed tranformer kernel #294

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx when using deepspeed tranformer kernel #294

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions