Open
Description
📚 Documentation
In your documentation, you refer to the function deepspeed.checkpointing.checkpoint
, but it looks like it does not exist (anymore?). Can you update that section?
While we're at it, can you provide a more common use-case as an example? The guide warns against wrapping an entire model, but having a pretrained language model from transformers
is probably the most common use case. What if someone is just using GPT-2 or T5 with no further mathematical layers? What should get wrapped then?
cc @Borda @awaelchli