🚀 The feature, motivation and pitch
As title, the gather_logits should be enabled by default, and we have to check how to include the final gemm (could include several other ops before the gemm) to the piecewise cudagraph.
Alternatives
No response
Additional context
No response
Before submitting a new issue...