Skip to content

Conversation

@aditvenk
Copy link

@aditvenk aditvenk commented Dec 13, 2025

Being able to compile fw/bw graphs using compile_fx_inner could help with establishing perf rooflines.

Full inductor compilation is achieved using compile_fx_inner, however, it requires the graph to have been decomposed using Inductor's default decomposition table. We apply this decomposition as a pass on the joint graph. We need to be careful to suitably unwrap the primals/tangents before running this decomposition.

Manual testing:
NGPU=4
CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml
TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train
./run_train.sh
--model.name $MODEL_NAME
--parallelism.data_parallel_shard_degree=2
--parallelism.tensor_parallel_degree=2
--job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config
--compile.joint_passes inductor_decomposition
--compile.passes full_inductor_compilation

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 13, 2025
Copy link
Contributor

@SherlockNoMad SherlockNoMad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with comments.

Being able to compile fw/bw graphs using compile_fx_inner will help with establishing perf rooflines.

<!-- ps-id: 5d590700-6d1f-44fe-8f70-4d2ea39106f4 -->
@aditvenk aditvenk force-pushed the ps/rr/_compiler_toolkit__add_option_for_full_inductor_ branch from bd66bbd to 82834ba Compare December 23, 2025 05:52
@aditvenk
Copy link
Author

aot_eager for llama3:

[rank0]:[titan] 2025-12-22 21:58:57,041 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-12-22 21:58:57,057 - root - INFO - Model compiler_toolkit.llama3 debugmodel size: 6,163,712 total parameters
[rank0]:[titan] 2025-12-22 21:58:57,077 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-12-22 21:58:57,077 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-12-22 21:58:57,097 - root - INFO - Applied Data Parallel (simple_fsdp) (dp mode=fully_shard) to the model
[rank0]:[titan] 2025-12-22 21:58:57,279 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-12-22 21:58:57,279 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-12-22 21:58:57,437 - root - INFO - Training starts at step 1
[rank0]:/data/users/avenkataraman/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py:2494: UserWarning: Your compiler for AOTAutograd is returning a function that doesn't take boxed arguments. Please wrap it with functorch.compile.make_boxed_func or handle the boxed arguments yourself. See https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670 for rationale.
[rank0]:  out = call_func_at_runtime_with_args(
[rank0]:[titan] 2025-12-22 21:59:03,495 - root - INFO - step:  1  loss:  7.9925  grad_norm:  1.4785  memory:  0.65GiB(0.69%)  tps: 1,272  tflops: 0.09  mfu: 0.01%
[rank0]:[titan] 2025-12-22 21:59:03,495 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/avenkataraman/torchtitan/torchtitan/distributed/utils.py:396: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
[rank0]:[titan] 2025-12-22 21:59:03,599 - root - INFO - step:  2  loss:  7.6803  grad_norm:  1.5720  memory:  0.68GiB(0.72%)  tps: 79,257  tflops: 5.67  mfu: 0.57%
[rank0]:[titan] 2025-12-22 21:59:03,655 - root - INFO - step:  3  loss:  6.9339  grad_norm:  1.9887  memory:  0.68GiB(0.72%)  tps: 147,082  tflops: 10.53  mfu: 1.06%
[rank0]:[titan] 2025-12-22 21:59:03,712 - root - INFO - step:  4  loss:  6.0866  grad_norm:  2.2987  memory:  0.68GiB(0.72%)  tps: 143,666  tflops: 10.28  mfu: 1.04%
[rank0]:[titan] 2025-12-22 21:59:03,768 - root - INFO - step:  5  loss:  5.2493  grad_norm:  2.4151  memory:  0.68GiB(0.72%)  tps: 148,569  tflops: 10.64  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:03,829 - root - INFO - step:  6  loss:  4.7912  grad_norm:  2.6229  memory:  0.68GiB(0.72%)  tps: 134,967  tflops: 9.66  mfu: 0.98%
[rank0]:[titan] 2025-12-22 21:59:03,884 - root - INFO - step:  7  loss:  4.4615  grad_norm:  2.3111  memory:  0.68GiB(0.72%)  tps: 148,651  tflops: 10.64  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:03,943 - root - INFO - step:  8  loss:  4.2301  grad_norm:  1.9856  memory:  0.68GiB(0.72%)  tps: 140,177  tflops: 10.03  mfu: 1.01%
[rank0]:[titan] 2025-12-22 21:59:04,005 - root - INFO - step:  9  loss:  4.4596  grad_norm:  1.7412  memory:  0.68GiB(0.72%)  tps: 132,916  tflops: 9.51  mfu: 0.96%
[rank0]:[titan] 2025-12-22 21:59:04,069 - root - INFO - step: 10  loss:  4.0634  grad_norm:  1.9408  memory:  0.68GiB(0.72%)  tps: 127,978  tflops: 9.16  mfu: 0.93%

full inductor:

[rank0]:[titan] 2025-12-22 21:59:34,941 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-12-22 21:59:34,956 - root - INFO - Model compiler_toolkit.llama3 debugmodel size: 6,163,712 total parameters
[rank0]:[titan] 2025-12-22 21:59:34,974 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-12-22 21:59:34,975 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-12-22 21:59:34,994 - root - INFO - Applied Data Parallel (simple_fsdp) (dp mode=fully_shard) to the model
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - INFO - Using joint passes from config: ['inductor_decomposition']
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - WARNING - Full Inductor compilation is enabled. Note that Inductor may change numerics and does not guarantee bitwise equivalent results compared to eager mode.
[rank0]:[titan] 2025-12-22 21:59:35,019 - root - INFO - Using compiler passes from config: ['full_inductor_compilation']
[rank0]:[titan] 2025-12-22 21:59:35,159 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-12-22 21:59:35,160 - root - INFO - CUDA memory usage for model: 0.01GiB(0.01%)
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-12-22 21:59:35,331 - root - INFO - Training starts at step 1
[rank0]:[titan] 2025-12-22 21:59:40,181 - root - INFO - Applying decompositions to joint graph
[rank0]:[titan] 2025-12-22 21:59:41,411 - root - INFO - Decompositions applied successfully to joint graph
[rank0]:[titan] 2025-12-22 21:59:41,921 - root - INFO - Applying pass: full_inductor_compilation_pass
[rank0]:[titan] 2025-12-22 21:59:42,538 - root - INFO - Applying pass: full_inductor_compilation_pass
[rank0]:[titan] 2025-12-22 21:59:43,240 - root - INFO - step:  1  loss:  8.2073  grad_norm:  1.3710  memory:  0.63GiB(0.67%)  tps: 989  tflops: 0.07  mfu: 0.01%
[rank0]:[titan] 2025-12-22 21:59:43,240 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:/home/avenkataraman/torchtitan/torchtitan/distributed/utils.py:396: UserWarning: Set timeout is now only supported for either nccl or gloo.
[rank0]:  torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
[rank0]:[titan] 2025-12-22 21:59:43,343 - root - INFO - step:  2  loss:  7.9428  grad_norm:  1.4296  memory:  0.64GiB(0.67%)  tps: 79,314  tflops: 5.68  mfu: 0.57%
[rank0]:[titan] 2025-12-22 21:59:43,399 - root - INFO - step:  3  loss:  7.2457  grad_norm:  1.8050  memory:  0.64GiB(0.67%)  tps: 147,889  tflops: 10.59  mfu: 1.07%
[rank0]:[titan] 2025-12-22 21:59:43,458 - root - INFO - step:  4  loss:  6.4076  grad_norm:  2.2389  memory:  0.64GiB(0.67%)  tps: 140,255  tflops: 10.04  mfu: 1.02%
[rank0]:[titan] 2025-12-22 21:59:43,510 - root - INFO - step:  5  loss:  5.4848  grad_norm:  2.4729  memory:  0.64GiB(0.67%)  tps: 156,290  tflops: 11.19  mfu: 1.13%
[rank0]:[titan] 2025-12-22 21:59:43,600 - root - INFO - step:  6  loss:  4.9371  grad_norm:  2.3799  memory:  0.64GiB(0.67%)  tps: 91,446  tflops: 6.55  mfu: 0.66%
[rank0]:[titan] 2025-12-22 21:59:43,652 - root - INFO - step:  7  loss:  4.6137  grad_norm:  2.3870  memory:  0.64GiB(0.67%)  tps: 158,346  tflops: 11.34  mfu: 1.15%
[rank0]:[titan] 2025-12-22 21:59:43,711 - root - INFO - step:  8  loss:  4.4112  grad_norm:  2.2359  memory:  0.64GiB(0.67%)  tps: 141,054  tflops: 10.10  mfu: 1.02%
[rank0]:[titan] 2025-12-22 21:59:43,766 - root - INFO - step:  9  loss:  4.5883  grad_norm:  1.9379  memory:  0.64GiB(0.67%)  tps: 148,920  tflops: 10.66  mfu: 1.08%
[rank0]:[titan] 2025-12-22 21:59:43,828 - root - INFO - step: 10  loss:  4.2006  grad_norm:  2.0740  memory:  0.64GiB(0.67%)  tps: 132,314  tflops: 9.47  mfu: 0.96%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants