fix(duplication): fix duplication core because of removing or pausing #2243

lengyuexuexuan · 2025-04-29T09:19:58Z

What problem does this PR solve?

#2211

What is changed and how does it work?

Background
During duplicationof a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.
Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
b. The core dump occurs during the execution of cancel_all().
c. The issue occurs with low probability (approximately 1 in 100).

Through analysis using ASAN (AddressSanitizer):
dup_remove_asan.txt
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.

incubator-pegasus/src/task/task.h

Line 341 in e64faa7

void clear_non_trivial_on_task_end() override { _cb = nullptr; }

d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.

incubator-pegasus/src/task/task.cpp

Line 224 in e64faa7

this->release_ref(); // added in enqueue(pool)
Conclusion
Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.
Solution
Replace cancel_all() with wait_all().

Tests

Manual test
Based on 20 test runs, the issue was successfully avoided, confirming the effectiveness of the solution.

fix: fix duplication core because of removing or pausing

6380cbb

github-actions bot added the cpp label Apr 29, 2025

lengyuexuexuan changed the title ~~fix: fix duplication core because of removing or pausing~~ fix(duplication): fix duplication core because of removing or pausing Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(duplication): fix duplication core because of removing or pausing #2243

fix(duplication): fix duplication core because of removing or pausing #2243

Uh oh!

lengyuexuexuan commented Apr 29, 2025

Uh oh!

Uh oh!

fix(duplication): fix duplication core because of removing or pausing #2243

Are you sure you want to change the base?

fix(duplication): fix duplication core because of removing or pausing #2243

Uh oh!

Conversation

lengyuexuexuan commented Apr 29, 2025

What problem does this PR solve?

What is changed and how does it work?

Tests

Uh oh!

Uh oh!