-
Notifications
You must be signed in to change notification settings - Fork 62
update flight recorder usage #2292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the flight recorder integration in ProcessGroupXCCL to track trace reset epochs alongside trace IDs. The changes ensure that when retiring trace IDs, the corresponding reset epoch is also provided to maintain proper lifecycle management of flight recorder entries.
- Added
trace_reset_epoch_field toWorkXCCLclass to store the reset epoch value - Updated all
FlightRecorderXCCL::record()call sites to capture and store both the trace ID and reset epoch - Modified
retire_id()calls to pass the reset epoch parameter
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/xccl/ProcessGroupXCCL.hpp | Added trace_reset_epoch_ optional field to track reset epoch in WorkXCCL class |
| src/xccl/ProcessGroupXCCL.cpp | Updated record/retire logic to capture and pass trace reset epochs in initWork(), collective(), and pointToPoint() methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary: we're making changes to flight recorder's record and reset_id api's. update the usage accordingly.
| pgStatus_, | ||
| isP2P); | ||
|
|
||
| r->trace_id_ = traceId.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change depends on PyTorch PR pytorch/pytorch#166970, and 166970 also depends on this PR for merging, creating a circular dependency. Could you please add a check to see if traceId is of type uint64_t? If it is, assign trace_id_ directly to r->trace_id_; otherwise, parse it using traceId.id.
| FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false); | ||
| [id, reset_epoch](at::ivalue::Future&) { | ||
| FlightRecorderXCCL::get()->retire_id( | ||
| id, reset_epoch, /*compute_duration*/ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s not pass it for now to maintain compatibility with both before and after the 166970 merge.
| FlightRecorderXCCL::get()->retire_id(id, /*compute_duration*/ false); | ||
| [id, reset_epoch](at::ivalue::Future&) { | ||
| FlightRecorderXCCL::get()->retire_id( | ||
| id, reset_epoch, /*compute_duration*/ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| id, reset_epoch, /*compute_duration*/ false); | |
| id, /*compute_duration*/ false); |
| options_->timeout, | ||
| pgStatus_, | ||
| true); | ||
| work->trace_id_ = traceId.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Could you please add a check to see if traceId is of type uint64_t? If it is, assign trace_id_ directly to r->trace_id_; otherwise, parse it using traceId.id.
|
@tushar00jain This change depends on PyTorch PR pytorch/pytorch#166970, and 166970 also depends on this PR for merging, creating a circular dependency. My suggestion is to make some check to maintain compatibility with both before and after the 166970 merge. |
Summary:
we're making changes to flight recorder's record and reset_id api's. update the usage accordingly.