Skip to content

Conversation

@tushar00jain
Copy link

Summary:
we're making changes to flight recorder's record and reset_id api's. update the usage accordingly.

@tushar00jain tushar00jain marked this pull request as ready for review November 5, 2025 02:39
Copilot AI review requested due to automatic review settings November 5, 2025 02:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the flight recorder integration in ProcessGroupXCCL to track trace reset epochs alongside trace IDs. The changes ensure that when retiring trace IDs, the corresponding reset epoch is also provided to maintain proper lifecycle management of flight recorder entries.

  • Added trace_reset_epoch_ field to WorkXCCL class to store the reset epoch value
  • Updated all FlightRecorderXCCL::record() call sites to capture and store both the trace ID and reset epoch
  • Modified retire_id() calls to pass the reset epoch parameter

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/xccl/ProcessGroupXCCL.hpp Added trace_reset_epoch_ optional field to track reset epoch in WorkXCCL class
src/xccl/ProcessGroupXCCL.cpp Updated record/retire logic to capture and pass trace reset epochs in initWork(), collective(), and pointToPoint() methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Summary:
we're making changes to flight recorder's record and reset_id api's. update the usage accordingly.
pgStatus_,
isP2P);

r->trace_id_ = traceId.id;
Copy link
Contributor

@zhangxiaoli73 zhangxiaoli73 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change depends on PyTorch PR pytorch/pytorch#166970, and 166970 also depends on this PR for merging, creating a circular dependency. Could you please add a check to see if traceId is of type uint64_t? If it is, assign trace_id_ directly to r->trace_id_; otherwise, parse it using traceId.id.

FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false);
[id, reset_epoch](at::ivalue::Future&) {
FlightRecorderXCCL::get()->retire_id(
id, reset_epoch, /*compute_duration*/ false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s not pass it for now to maintain compatibility with both before and after the 166970 merge.

FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false);
[id, reset_epoch](at::ivalue::Future&) {
FlightRecorderXCCL::get()->retire_id(
id, reset_epoch, /*compute_duration*/ false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id, reset_epoch, /*compute_duration*/ false);
id, /*compute_duration*/ false);

options_->timeout,
pgStatus_,
true);
work->trace_id_ = traceId.id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Could you please add a check to see if traceId is of type uint64_t? If it is, assign trace_id_ directly to r->trace_id_; otherwise, parse it using traceId.id.

@zhangxiaoli73
Copy link
Contributor

@tushar00jain This change depends on PyTorch PR pytorch/pytorch#166970, and 166970 also depends on this PR for merging, creating a circular dependency. My suggestion is to make some check to maintain compatibility with both before and after the 166970 merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants