Description
Current knowledge distillation recipes don't have support for activation offloading and opt_in_bwd.
The implementation should be similar to the one in other recipes, like full_finetuning_distributed.
After enabling it in the recipe, it should also be enabled in the configs related to KD.
PRs with reference implementation:
activation offloading: #1847
opt_in_bwd implementation: #1833
KD recipes:
https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py
https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_distributed.py
after implementing it, run it with the flag on/off and plot the graphs of loss/memory/words per second. The easier way is to add the wandb logger to the config.
to update configs in bulk, you can use the script here: #1954
Activity