Skip to content

--hf_deepspeed_save flag to use Hugging Face Deepspeed logic and no configure_optimizers if optimizer/scheduler defined #17673

Open
@jamesharrisivi

Description

@jamesharrisivi

Description & Motivation

I have been a long time lightning user, but the Deepspeed integration has made it unusable, and as deepspeed is used for all model training this is a big problem.

I propose just porting over the HF Trainer deepspeed saving logic: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2352 for checkpoints as I have found the lightning logic doesn't work.

The HF Trainer logic is more battle tested and works for each stage with various model sizes. When I use lightning quite often it doesn't work making the whole training run useless. HF Trainer also always saves a pytorch_model.bin with the checkpoint and then global_step folder with the deepspeed optimizer states. This makes a lot more sense—so you do'nt have to faff about converting optimizer states if you want to use the pytorch model which is often 10% of the size anyway, so negligible to save each time.

I would also like to be able to define the optimizer and scheduler in the DS config without breaking the lightning logic, there should be a default to invalidate configure_optimizers if these are defined in the config. Most people training models like to use their own DS config with optimizer and scheduler defined and don't want to have to faff about with configure_optimizers when it can be handled by deepspeed

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions