Skip to content

Can not capture cudagraph after load checkpoint #3871

@new-TonyWang

Description

@new-TonyWang

Describe the bug

cudagraph_error.log

A clear and concise description of what the bug is. Tag the @mcore-oncall
to get oncall's attention to this issue.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

successfuly capture cudagraph and run
A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.
gpu: b300+cuda 13
megatron-lm: 0.16.0rc0
transformer-engine: 2.11.0.dev0
pytorch: 2.9.1+cu130

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions