Closed
Description
Thanks for opening source the great code!
I try to load HugginFace checkpoint and run BingBertSquad example with deepspeed transformer kernal.
The script:
#~/bin/bash
#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address
NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
LR=${5:-0.00003}
SEED=${6:-12345}
MASTER_PORT=${7:-29500}
DROPOUT=${8:-0.1}
echo "lr is ${LR}"
echo "seed is $SEED"
echo "master port is $MASTER_PORT"
echo "dropout is ${DROPOUT}"
# Force deepspeed to run with only local node
NUM_NODES=1
HOSTFILE=/dev/null
NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=12
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
GRAD_ACCUM_STEPS=1
else
GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
JOB_NAME="deepspeed_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
config_json=deepspeed_bsz24_config.json
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
--master_port=${MASTER_PORT} \
--hostfile ${HOSTFILE} \
nvidia_run_squad_deepspeed.py \
--bert_model ../../bert-base-uncased \
--do_train \
--do_lower_case \
--predict_batch_size 12 \
--do_predict \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_batch_size $PER_GPU_BATCH_SIZE \
--learning_rate ${LR} \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir $OUTPUT_DIR \
--job_name ${JOB_NAME} \
--gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
--deepspeed \
--deepspeed_config ${config_json} \
--dropout ${DROPOUT} \
--model_file $MODEL_FILE \
--seed ${SEED} \
--ckpt_type HF \
--origin_bert_config_file ../../bert-base-uncased/config.json \
--deepspeed_transformer_kernel \
--fp16
"
echo ${run_cmd}
eval ${run_cmd}
I run in two environment:
(1) 1080ti with the provided docker
1GPU with fp32 --> success
1GPU with fp16 --> NAN
2GPU with fp32 --> error
(2) TITAN RTX and manually use install.sh
1GPU with fp16 --> success
2GPU with fp16--> error
The error on the RTX server is shown in below (it is similiar to the error on the 1080ti sever):
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
!!!! kernel execution error. (m: 2304, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 384, n: 384, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 384, k: 384, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 4608, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 4608, k: 3072, error: 13)
Traceback (most recent call last):
File "nvidia_run_squad_deepspeed.py", line 1147, in <module>
main()
File "nvidia_run_squad_deepspeed.py", line 998, in main
start_positions, end_positions)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 743, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 1488, in forward
output_all_encoded_layers=False)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 937, in forward
checkpoint_activations=checkpoint_activations)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/private/yedeming/DeepSpeedExamples/BingBertSquad/turing/nvidia_modeling.py", line 572, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 560, in forward
self.config)
File "/home/yedeming/.local/lib/python3.6/site-packages/deepspeed/ops/transformer/transformer.py", line 213, in forward
config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f8d765e71e2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f8d76835f92 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f8d765d59cd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x540ae2 (0x7f8dc216dae2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x540b86 (0x7f8dc216db86 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: /home/yedeming/.local/bin/python3() [0x54f226]
frame #6: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #7: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #8: /home/yedeming/.local/bin/python3() [0x572df1]
frame #9: /home/yedeming/.local/bin/python3() [0x54f202]
frame #10: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #11: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #12: /home/yedeming/.local/bin/python3() [0x572e67]
frame #13: /home/yedeming/.local/bin/python3() [0x54f202]
frame #14: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #15: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #16: /home/yedeming/.local/bin/python3() [0x572e67]
frame #17: /home/yedeming/.local/bin/python3() [0x54f202]
frame #18: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #19: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #20: /home/yedeming/.local/bin/python3() [0x572e67]
frame #21: /home/yedeming/.local/bin/python3() [0x54f202]
frame #22: /home/yedeming/.local/bin/python3() [0x572cd0]
frame #23: /home/yedeming/.local/bin/python3() [0x5b5abf]
frame #24: /home/yedeming/.local/bin/python3() [0x572e67]
frame #25: /home/yedeming/.local/bin/python3() [0x54f202]
frame #26: /home/yedeming/.local/bin/python3() [0x588a98]
frame #27: /home/yedeming/.local/bin/python3() [0x5ad558]
frame #28: /home/yedeming/.local/bin/python3() [0x5ad56e]
frame #29: /home/yedeming/.local/bin/python3() [0x56b636]
frame #30: PyDict_SetItemString + 0x153 (0x570da3 in /home/yedeming/.local/bin/python3)
frame #31: PyImport_Cleanup + 0x76 (0x4f2ee6 in /home/yedeming/.local/bin/python3)
frame #32: Py_FinalizeEx + 0x5e (0x637f7e in /home/yedeming/.local/bin/python3)
frame #33: Py_Main + 0x395 (0x638fe5 in /home/yedeming/.local/bin/python3)
frame #34: main + 0xe0 (0x4b0dc0 in /home/yedeming/.local/bin/python3)
frame #35: __libc_start_main + 0xe7 (0x7f8ddfb0db97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5b26fa in /home/yedeming/.local/bin/python3)
Looking forward to your reply!
Best wishes,
Deming Ye
Metadata
Metadata
Assignees
Labels
No labels