Skip to content

Conversation

@Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Oct 15, 2025

Refer pytorch/pytorch@ab557421a47, Implemented the event cache, added the start event, and enabled support for event timing (as event timing no longer causes hangs after version 2021.17). Its usage scenario lies in getDurationFromEvent, which is part of the FR feature triggered by the watchdog thread; therefore, no test case is currently available.

Usage

    pg = dist.distributed_c10d._get_default_group()
    pg._enable_collectives_timing()
    x = torch.ones([2, 2]).to(device)
    num_repeats = 10
    for _ in range(num_repeats):
        dist.all_reduce(x)
    time.sleep(1)
    t = pickle.loads(torch._C._distributed_c10d._dump_xccl_trace())
    for seq in range(num_repeats):
        duration = t["entries"][seq]["duration_ms"]
        print(duration)

Copilot AI review requested due to automatic review settings October 15, 2025 05:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds time event support to the XCCL (XPU Collective Communication Library) system by introducing event caching and timing capabilities. The changes enable performance measurement and event management for XPU operations through a caching mechanism.

Key changes:

  • Introduces XPUEventCache class for efficient event object reuse and timing support
  • Adds timing functionality to WorkXCCL with start/end events and duration calculation
  • Updates point-to-point communication operations to support timing and preprocessing/postprocessing hooks

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/xccl/XPUEventCache.hpp Defines the XPUEventCache class interface for managing cached XPU events
src/xccl/XPUEventCache.cpp Implements event caching logic with timing support and thread-local device mapping
src/xccl/ProcessGroupXCCL.hpp Adds timing support fields and template method overloads for point-to-point operations
src/xccl/ProcessGroupXCCL.cpp Integrates event caching, timing functionality, and refactors point-to-point operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious about how much performance improvement we could get by using the event cache. I guess nothing...
Overall LGTM.
I need some more time to go through the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants