implement activation offloading and opt_in_bwd in knowledge_distillation recipes

Current knowledge distillation recipes don't have support for activation offloading and opt_in_bwd.

The implementation should be similar to the one in other recipes, like full_finetuning_distributed.

After enabling it in the recipe, it should also be enabled in the configs related to KD.

PRs with reference implementation: 
activation offloading: https://github.com/pytorch/torchtune/pull/1847
opt_in_bwd implementation: https://github.com/pytorch/torchtune/pull/1833

KD recipes:
https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_single_device.py
https://github.com/pytorch/torchtune/blob/main/recipes/knowledge_distillation_distributed.py

after implementing it, run it with the flag on/off and plot the graphs of loss/memory/words per second. The easier way is to [add the wandb logger to the config](https://pytorch.org/torchtune/main/deep_dives/wandb_logging.html).

to update configs in bulk, you can use the script here: https://github.com/pytorch/torchtune/pull/1954


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement activation offloading and opt_in_bwd in knowledge_distillation recipes #1959

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

implement activation offloading and opt_in_bwd in knowledge_distillation recipes #1959

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions