Skip to content

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers)#109

Open
lessw2020 wants to merge 2 commits intofacebookresearch:mainfrom
lessw2020:comm-overlap-fix
Open

update is_comm_kernel check to work with newer nccl versions (and thus generate expected comm/compute overlap numbers)#109
lessw2020 wants to merge 2 commits intofacebookresearch:mainfrom
lessw2020:comm-overlap-fix

Conversation

@lessw2020
Copy link

What does this PR do?

This PR fixes a common error hit with newer versions of Nccl.
Most traces currently break on computing the comm/compute overlap numbers b/c it does not account for the kernel name being prefaced with 'ncclDev' instead of only 'ncclKernel'.

The error is

Runtime Warning: invalid value encountered in scalar divide return (shifted_overlap["time_y"] - shifted_overlap["time_x"].sum())

The fix is simple, check for both to get the proper comm/compute overlap numbers in hta utility function "is comm kernel".

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you update the changelog?
    • N/A

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024
@anupambhatnagar
Copy link
Contributor

@lessw2020 can you please add a test case for this? also please resolve the pre-commit errors.

@anupambhatnagar anupambhatnagar self-requested a review March 5, 2024 18:55
@fengxizhou fengxizhou self-assigned this Mar 8, 2024
@fengxizhou fengxizhou requested review from fengxizhou and removed request for anupambhatnagar March 23, 2024 18:39
@amazloumi
Copy link

amazloumi commented Jul 22, 2024

I come across this issue when working on distributed torch profiling, get_gpu_kernel_breakdown() does not show any communication breakdown. Do you have any plan to merging the fix of this PR soon?

I have two different types of nccl related items on my torch trace file (collected using torch.profiler.tensorboard_trace_handler) that none of them are captured as COMM category in TraceAnalysis .get_gpu_kernel_breakdown(). The items are as follows:

  1. The items with "cat": "user_annotation" and "name" starting with "nccl:" such as nccl:_all_gather_base, nccl:_reduce_scatter_base, nccl:reduce and nccl:all_reduce
  2. The items with "cat": "kernel" and "name" starting with "ncclDevKernel_" such as ncclDevKernel_AllReduce_Sum_f32_RING_LL, ncclDevKernel_Reduce_Sum_f32_RING_LL, ncclDevKernel_ReduceScatter_Sum_f32_RING_LL and ncclDevKernel_AllGather_RING_LL

I am not sure if the first item also needs to be considered as kernel communication cost while checking tensorboard I believe torch.profiler plug-in on tensorboard is considering those on item 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants