expose max-autotune in configs for better perf

TLDR: when compiling, max-autotune will improve perf, but will take longer to compile. If the user has a long training job to do, it is definetely worth it.

TODO:
1. Define the flag in just one config
2. Find all places where we compile (prob: flex attention utils, compile model, compile loss)
3. Run a test without max-autotune and log with metric_logger=torchtune.training.metric_logging.WandBLogger 
4. Run a test with max-autotune
5. Share results in a PR
6. If accepted, implement it for every config/recipe (ideally this should be implemented in the utility level, not recipe level)

```
tune run full_finetune_single_device --config llama3_2/1B_full_single_device dataset.packed=True tokenizer.max_seq_len=4096 dataset.split=train[:5%] metric_logger=torchtune.training.metric_logging.WandBLogger 
```

<img width="1151" alt="Image" src="https://github.com/user-attachments/assets/02f63cee-da74-4d38-9cac-57159a7747fc" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

expose max-autotune in configs for better perf #2373

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

expose max-autotune in configs for better perf #2373

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions