Description
Description & Motivation
I have been a long time lightning user, but the Deepspeed integration has made it unusable, and as deepspeed is used for all model training this is a big problem.
I propose just porting over the HF Trainer deepspeed saving logic: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2352 for checkpoints as I have found the lightning logic doesn't work.
The HF Trainer logic is more battle tested and works for each stage with various model sizes. When I use lightning quite often it doesn't work making the whole training run useless. HF Trainer also always saves a pytorch_model.bin with the checkpoint and then global_step folder with the deepspeed optimizer states. This makes a lot more sense—so you do'nt have to faff about converting optimizer states if you want to use the pytorch model which is often 10% of the size anyway, so negligible to save each time.
I would also like to be able to define the optimizer and scheduler in the DS config without breaking the lightning logic, there should be a default to invalidate configure_optimizers if these are defined in the config. Most people training models like to use their own DS config with optimizer and scheduler defined and don't want to have to faff about with configure_optimizers when it can be handled by deepspeed
Pitch
No response
Alternatives
No response
Additional context
No response
cc @Borda @awaelchli