Change training step to a scalar tensor so it works with CUDA graphs #842

jasooney23 · 2025-04-08T02:45:53Z

jasooney23
Apr 8, 2025

I was experimenting with the custom aggregator in the Turbulent Channel example and wanted to enable CUDA graphs for faster execution. However, currently step gets passed as a generic int from Trainer._cuda_graph_training_step, which means that when the CUDA graph gets captured, the step it was captured at is the step the graph will always execute using.

i.e., if my aggregator's forward takes step as an argument and the CUDA graph is captured at step = 20, then the aggregator will continue to execute with step = 20.

My simple fix is just to pass step as a Tensor, but i'm not sure if i should submit the change myself or just let someone bundle it as part of a bigger revision? (sorry, it's my first time participating in open source stuff!)

Thanks 😎

coreyjadams · 2025-10-31T15:35:24Z

coreyjadams
Oct 31, 2025
Maintainer

Hi @jasooney23 - sorry for a delayed reply, I didn't realize there were discussions being opened!

We would welcome this as a PR. In fact, small, targeted PRs like this are easier to review and easier to get accepted. There is no minimum revision size :).

Good idea to use CUDA graphs. Yes, your idea can work about using a torch.Tensor for step. I am not familiar enough with the Turbulent Channel example to predict any issues you might have, but watch out for two things: if step is used for dynamic control flow in the non-graph code, you may hit issues with capturing it in a graph. Similarly, since step is currently a python integer, watch out for host to device transfers. They will cause errors in graph execution.

Feel free to open a PR, even in draft form, if/when you're ready. Tag me, I'd be happy to take a look.

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change training step to a scalar tensor so it works with CUDA graphs #842

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Change training step to a scalar tensor so it works with CUDA graphs #842

Uh oh!

jasooney23 Apr 8, 2025

Replies: 1 comment

Uh oh!

coreyjadams Oct 31, 2025 Maintainer

jasooney23
Apr 8, 2025

coreyjadams
Oct 31, 2025
Maintainer