Commit 82761e7
[training, perf] fix: only use cu_seqlens THD FLOPS path for CP==1
L0_Launch_training's test_sft_example_runs_with_cp_and_packing (CP=2) hangs
deterministically on this PR (4/4 runs; main at the merged commit is green). The
test exercises the LLM gpt_step under context parallelism + packing, where this PR
newly runs the cu_seqlens-driven THD Σᵢ sᵢ² accounting in the per-microbatch
forward path. That path is only wired/validated for CP==1; under CP the batch and
its cu_seqlens are CP-partitioned per rank, so the per-rank computation is not yet
correct (the follow-up tracked in #4161) and was destabilizing the run.
Forward cu_seqlens to accumulate_flops_metadata only when CP==1; under CP>1 fall
back to the BSHD term — the exact behavior this test passed on before the THD
change. CP==1 (the verified configuration) is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>1 parent d443d38 commit 82761e7
1 file changed
Lines changed: 12 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
321 | 321 | | |
322 | 322 | | |
323 | 323 | | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
324 | 332 | | |
325 | 333 | | |
326 | 334 | | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
331 | 339 | | |
332 | 340 | | |
333 | 341 | | |
| |||
0 commit comments