Skip to content

fix(duplication): fix duplication core because of removing or pausing #2243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lengyuexuexuan
Copy link
Collaborator

What problem does this PR solve?

#2211

What is changed and how does it work?

  1. Background
    During duplicationof a table, if the commands dup remove/pause are executed or a balance operation is performed at the same time, there is a chance that a node may core dump with signal ID 11. The core dump locations vary, but they all have one thing in common: they occur during memory allocation or deallocation.

  2. Analysis
    Based on extensive testing, the following conclusions can be drawn:
    a. The issue only reproduces when there is write traffic. The difference between having and not having write traffic is: It adds the ship and load_private_log tasks.
    b. The core dump occurs during the execution of cancel_all().
    c. The issue occurs with low probability (approximately 1 in 100).

    Through analysis using ASAN (AddressSanitizer):
    dup_remove_asan.txt
    Based on ASAN analysis, the following conclusions can be drawn:
    a. The memory corruption occurs during the ship process. The mutations obtained from replaying the plog are passed to ship, leading to the issue.
    b. _load_mutations is captured by a lambda expression and then passed to a std::function. Since std::move is used, the lifetime of _load_mutations is tied to that of the std::function.
    c. The cancel_all() function is executed in the default thread pool. At this point, the following function is called. When the std::function is set to nullptr, it will release the memory it manages.

    void clear_non_trivial_on_task_end() override { _cb = nullptr; }

    d. However, each task executes exec_internal() in its own thread pool, and eventually calls release_ref(), which results in delete this.
    this->release_ref(); // added in enqueue(pool)

  3. Conclusion
    Both task.cancel() and task.exec_internal() destruct the std::function member inside the task object. These two operations are executed in different threads, and there is no mechanism in place to prevent race conditions between them. As a result, it is possible for both threads to attempt to destruct the same std::function, which can lead to a double deletion of the memory associated with _load_mutations. This ultimately causes memory corruption.

  4. Solution
    Replace cancel_all() with wait_all().

Tests
  • Manual test
    Based on 20 test runs, the issue was successfully avoided, confirming the effectiveness of the solution.

@github-actions github-actions bot added the cpp label Apr 29, 2025
@lengyuexuexuan lengyuexuexuan changed the title fix: fix duplication core because of removing or pausing fix(duplication): fix duplication core because of removing or pausing Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant