Open
Description
TLDR: when compiling, max-autotune will improve perf, but will take longer to compile. If the user has a long training job to do, it is definetely worth it.
TODO:
- Define the flag in just one config
- Find all places where we compile (prob: flex attention utils, compile model, compile loss)
- Run a test without max-autotune and log with metric_logger=torchtune.training.metric_logging.WandBLogger
- Run a test with max-autotune
- Share results in a PR
- If accepted, implement it for every config/recipe (ideally this should be implemented in the utility level, not recipe level)
tune run full_finetune_single_device --config llama3_2/1B_full_single_device dataset.packed=True tokenizer.max_seq_len=4096 dataset.split=train[:5%] metric_logger=torchtune.training.metric_logging.WandBLogger

Activity