Description
When I move the model saving code below into the epoch loop.
if args.output_dir is not None:
print_rank_0('saving the model ...', args.global_rank)
model_out = convert_lora_to_linear_layer(model)
if args.global_rank == 0:
save_hf_format(model_out, tokenizer, args)
The first epoch can be saved successfully, but when I start the second epoch of training, the error message is as follows:
File "main.py", line 353, in
main()
File "main.py", line 320, in main
model.backward(loss)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1827, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 353, in backward
scaled_loss.backward(create_graph=create_graph, retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
How to solve it?