How to save the model after each epoch

When I move the model saving code below into the epoch loop.


    if args.output_dir is not None:
        print_rank_0('saving the model ...', args.global_rank)
        model_out = convert_lora_to_linear_layer(model)

        if args.global_rank == 0:
            save_hf_format(model_out, tokenizer, args)


The first epoch can be saved successfully, but when I start the second epoch of training, the error message is as follows:

File "main.py", line 353, in
main()
File "main.py", line 320, in main
model.backward(loss)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1827, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/cuda11py3.9/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 353, in backward
scaled_loss.backward(create_graph=create_graph, retain_graph=retain_graph)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

How to solve it?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to save the model after each epoch #510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to save the model after each epoch #510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions