Skip to content

CUDA Error when run with multiple GPUs #454

Closed
@YeDeming

Description

@YeDeming

Thanks for opening source the great code!

I try to load HugginFace checkpoint and run BingBertSquad example with deepspeed transformer kernal.

The script:

#~/bin/bash

#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address

NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"

# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null

NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=12
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
       GRAD_ACCUM_STEPS=1
else
       GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_bsz24_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
       --master_port=${MASTER_PORT} \
       --hostfile ${HOSTFILE} \
       nvidia_run_squad_deepspeed.py \
       --bert_model ../../bert-base-uncased \
       --do_train \
       --do_lower_case \
       --predict_batch_size 12 \
       --do_predict \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size $PER_GPU_BATCH_SIZE \
       --learning_rate ${LR} \
       --num_train_epochs 2.0 \
       --max_seq_length 384 \
       --doc_stride 128 \
       --output_dir $OUTPUT_DIR \
       --job_name ${JOB_NAME} \
       --gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
       --deepspeed \
       --deepspeed_config ${config_json} \
       --dropout ${DROPOUT} \
       --model_file $MODEL_FILE \
       --seed ${SEED} \
       --ckpt_type HF \
       --origin_bert_config_file ../../bert-base-uncased/config.json \
       --deepspeed_transformer_kernel \
       --fp16
       "

echo ${run_cmd}
eval ${run_cmd}

I run in two environment:

(1) 1080ti with the provided docker
1GPU with fp32 --> success
1GPU with fp16 --> NAN
2GPU with fp32 --> error
(2) TITAN RTX and manually use install.sh
1GPU with fp16 --> success
2GPU with fp16--> error

The error on the RTX server is shown in below (it is similiar to the error on the 1080ti sever):

!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)                                    
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)                                   
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
Traceback (most recent call last):
  File "nvidia_run_squad_deepspeed.py", line 1147, in <module>
    main()
  File "nvidia_run_squad_deepspeed.py", line 998, in main
    start_positions, end_positions)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 743, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1488, in forward
    output_all_encoded_layers=False)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 937, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 572, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 560, in forward
    self.config)
  File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 213, in forward
    config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8d765e71e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f8d76835f92 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f8d765d59cd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x540ae2 (0x7f8dc216dae2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x540b86 (0x7f8dc216db86 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: /home/yedeming/.local/bin/python3() [0x54f226]
frame #6: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #7: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #8: /home/yedeming/.local/bin/python3() [0x572df1]
frame #9: /home/yedeming/.local/bin/python3() [0x54f202]
frame #10: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #11: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #12: /home/yedeming/.local/bin/python3() [0x572e67]
frame #13: /home/yedeming/.local/bin/python3() [0x54f202]
frame #14: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #15: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #16: /home/yedeming/.local/bin/python3() [0x572e67]
frame #17: /home/yedeming/.local/bin/python3() [0x54f202]
frame #18: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #19: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #20: /home/yedeming/.local/bin/python3() [0x572e67]
frame #21: /home/yedeming/.local/bin/python3() [0x54f202]
frame #22: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #23: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #24: /home/yedeming/.local/bin/python3() [0x572e67]
frame #25: /home/yedeming/.local/bin/python3() [0x54f202]
frame #26: /home/yedeming/.local/bin/python3() [0x588a98]
frame #27: /home/yedeming/.local/bin/python3() [0x5ad558]
frame #28: /home/yedeming/.local/bin/python3() [0x5ad56e]
frame #29: /home/yedeming/.local/bin/python3() [0x56b636]
frame #30: PyDict_SetItemString + 0x153 (0x570da3 in /home/yedeming/.local/bin/python3)
frame #31: PyImport_Cleanup + 0x76 (0x4f2ee6 in /home/yedeming/.local/bin/python3)
frame #32: Py_FinalizeEx + 0x5e (0x637f7e in /home/yedeming/.local/bin/python3)
frame #33: Py_Main + 0x395 (0x638fe5 in /home/yedeming/.local/bin/python3)
frame #34: main + 0xe0 (0x4b0dc0 in /home/yedeming/.local/bin/python3)
frame #35: __libc_start_main + 0xe7 (0x7f8ddfb0db97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5b26fa in /home/yedeming/.local/bin/python3)

Looking forward to your reply!

Best wishes,
Deming Ye

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions