Description
Describe the bug
Running an error
Log output
***** Evaluating perplexity, Epoch 0/1 *****
Traceback (most recent call last):
File "main.py", line 345, in
main()
File "main.py", line 306, in main
perplexity = evaluation(model, eval_dataloader)
File "main.py", line 257, in evaluation
outputs = model(**batch)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1675, in forward
loss = self.module(*inputs, **kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 950, in forward
logits = self.lm_head(outputs[0]).contiguous()
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2023-04-28 09:46:00,441] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2469
[2023-04-28 09:46:00,442] [ERROR] [launch.py:434:sigkill_handler] ['/home/sh0an/anaconda3/envs/Chat/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
To Reproduce
Execute script:
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
PyTorch: 2.0
CUDA: 11.8
System info (please complete the following information):
OS: Linux version 5.10.16.3-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) deepspeedai/DeepSpeed#1 SMP Fri Apr 2 22:23:49 UTC 2021
GPU: One RTX4070TI(12G)
Python version: 3.8