Skip to content

Conversation

@fmassa
Copy link
Contributor

@fmassa fmassa commented Jan 23, 2026

Previously, when we would call AutoParallel with compile=False, we wouldn't have any of the comms / compute overlap passes being applied to the model.

This effectively meant that we would need compile=True to have a performant autoparallelized model.

I've for now decided to call into all the post_grad passes, but it is also possible that we only call into the comms / compute overlap passes, to keep graph modifications to a minimum.

I'm now calling into the comms / compute reordering pass even when compile=False

@fmassa fmassa requested review from ezyang, wconstab and xmfan January 23, 2026 15:39
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 23, 2026
with V.set_fake_mode(fake_mode):
cuda_context = get_cuda_device_context(fx_g)
with cuda_context:
_recursive_post_grad_passes(fx_g, is_inference=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the post grad passes are bad for perf unless lowered e.g. view_to_reshape which materializes all views

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to only call into the comms / compute reordering pass, to keep graph changes to a minimum

@fmassa fmassa changed the title Call post_grad passes when compile=False Call comms / compute overlap passes when compile=False Jan 26, 2026
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems OK to me. i will say that it's not super clear to me what the best formulation is. It's a little arbitrary which compiler passes to put 'inside' vs 'outside'.

from a use-case perspective, it seems nice to always have the distributed passes run, even if codegen isn't important. otoh, other things like cudagraph might also be preferred, even without codegen. For debugging, the unmodified original graphmodule might be nice to get out? (though, you can see it in its various states of transformation using tlparse).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants