Fix MPI-Checker warning: Add MPIRequest destructor for proper resource cleanup#37915
Fix MPI-Checker warning: Add MPIRequest destructor for proper resource cleanup#37915
Conversation
Co-authored-by: blozano-tt <181790211+blozano-tt@users.noreply.github.com>
Co-authored-by: blozano-tt <181790211+blozano-tt@users.noreply.github.com>
Co-authored-by: blozano-tt <181790211+blozano-tt@users.noreply.github.com>
|
/codeowners ping Please check for validity. Maybe we can fix the resource leaks observed in CI? |
CodeOwners Group AnalysisThis PR requires approval from one member of each of the following groups: Summary: 1 pending groups, 0 approved groups Group Information:
Note: At least one approval from each group is sufficient. |
|
Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), this PR Fix MPI-Checker warning: Add MPIRequest destructor for proper resource cleanup by Copilot (@Copilot) needs your approval/review to merge this. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
tt_metal/distributed/multihost/mpi_distributed_context.hpp:52
- Rule-of-5 violation: Adding a custom destructor without deleting or defining copy/move constructors and assignment operators creates a risk of double-free bugs. Since
MPIRequestmanages anMPI_Requestresource that should not be copied (cancelling/freeing the same MPI request twice is undefined behavior), you should either:
- Delete copy/move operations:
MPIRequest(const MPIRequest&) = delete; MPIRequest& operator=(const MPIRequest&) = delete; - Or implement proper copy/move semantics (though this is complex with MPI requests)
While the class is currently always used through std::shared_ptr, this protection is not enforced at the type level and could be accidentally violated in future code.
class MPIRequest : public Request {
public:
explicit MPIRequest(MPI_Request req) : req_(req) {}
~MPIRequest() override;
Status wait() override;
std::optional<Status> test() override;
void cancel() override;
bool active() const override;
private:
mutable MPI_Request req_{};
bool done_{};
};
| MPIRequest::~MPIRequest() { | ||
| if (done_ || was_mpi_finalized()) { | ||
| return; | ||
| } | ||
| // Cancel and free the incomplete request | ||
| // Not using MPI_CHECK here to avoid throwing in destructor | ||
| // Errors are ignored because we cannot throw from a destructor | ||
| // and the request will be freed regardless | ||
| [[maybe_unused]] int cancel_err = MPI_Cancel(&req_); | ||
| [[maybe_unused]] int free_err = MPI_Request_free(&req_); | ||
| } |
There was a problem hiding this comment.
Potential race condition: The done_ flag is checked without synchronization, but in a multi-threaded MPI environment (this code uses MPI_THREAD_MULTIPLE), another thread could be concurrently calling wait(), test(), or cancel() which also access done_. This could lead to:
- Double-free: destructor runs while
cancel()is executing - Use-after-free: destructor frees
req_whilewait()is using it
Consider adding std::atomic<bool> for done_ or using a mutex to protect both req_ and done_ access across all methods. Note that cancel() also has this issue (line 162-169), but the destructor makes the race more likely since it can be called from any thread when the shared_ptr refcount drops to zero.
Ticket
N/A
Problem description
Clang Static Analyzer MPI-Checker flagged
optin.mpi.MPI-Checkerwarning at line 254: "Request 'req' has no matching wait."MPIRequestwrapsMPI_Requestbut lacked a destructor. Per MPI standard, requests must be completed viaMPI_Wait(),MPI_Test(), or cancelled/freed before program exit. Without a destructor, incomplete requests leak whenMPIRequestobjects are destroyed without explicitwait()/cancel()calls.What's changed
Added
~MPIRequest()destructor that:done_flagwas_mpi_finalized()checkMPI_Cancel()+MPI_Request_free()Mirrors existing
cancel()method pattern. 13 LOC added across 2 files.Checklist
Model tests
If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers
models-mandatoryandmodels-extendedpresets.The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.
models-mandatorypreset (runs: Device perf regressions and Frequent model and ttnn tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Unit tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Quick tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.