fix(duplication): fix duplication core because of removing or pausing #2243
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
#2211
What is changed and how does it work?
Background
During duplicationof a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.
Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
b. The core dump occurs during the execution of cancel_all().
c. The issue occurs with low probability (approximately 1 in 100).
Through analysis using ASAN (AddressSanitizer):
dup_remove_asan.txt
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.
incubator-pegasus/src/task/task.h
Line 341 in e64faa7
d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.
incubator-pegasus/src/task/task.cpp
Line 224 in e64faa7
Conclusion
Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.
Solution
Replace cancel_all() with wait_all().
Tests
Based on 20 test runs, the issue was successfully avoided, confirming the effectiveness of the solution.