Skip to content

what(): CUDA error: an illegal memory access was encountered #592

Open
@zerlinkcn

Description

@zerlinkcn

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b1fd134d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1b1fcdd36b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1b1fdafb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c36b (0x7f1b1fd8036b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7f1b1fd8f930 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d56d6 (0x7f1b867306d6 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7f1b1fcf8e77 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x41 (0x7f1b1fcf3391 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f1b1fcf3404 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)

step1:
deepspeed main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--model_name_or_path FreedomIntelligence/phoenix-inst-chat-7b
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 9.65e-6
--weight_decay 0.
--num_train_epochs 16
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--gradient_checkpointing
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

step2:
deepspeed main.py
--data_path bote/whoareyou
--data_split 2,4,4
--model_name_or_path bigscience/bloomz-560m
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--disable_dropout
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

step3:
deepspeed --master_port 12346 main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_mini_train_batch_size 1
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_gradient_checkpointing
--disable_actor_dropout
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

GPU:
8 * A40(48G)

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions