Skip to content

Race condition: disk writes/deletes during second full_sync can cause duplication or SEGFAULT on follower #1806

@viktorerlingsson

Description

@viktorerlingsson

Describe the bug
Disk mutations (message publishing, queue purge/delete) are not atomic with the replication actions sent to followers. During the second full_sync (which holds @lock), another thread can write to or delete a file on disk while the corresponding append/delete_file call blocks waiting for @lock. This can cause:

  • Message duplication — a message written to disk is included in the synced file and then also sent as an append action after the follower is marked synced.
  • SEGFAULT — a queue purge/delete closes and unmaps an MFile that full_sync is reading via file.to_slice.

Timeline (duplication)

  1. Thread A (follower sync) — starts second full_sync, acquires @lock
  2. Thread Afiles_with_hash computes hash for file_1 (current state)
  3. Thread B (publisher) — writes msg_x to file_1 on disk
  4. Thread B — calls replicator.append(file_1, msg_x)each_follower → blocks waiting for @lock
  5. Thread A — follower requests file_1, receives it (now including msg_x)
  6. Thread A — marks follower as synced, releases @lock
  7. Thread B — acquires @lock, follower is now synced, sends append(file_1, msg_x)
  8. Resultfile_1 on the follower contains msg_x twice

Root cause

The disk write (step 3) and the replication action (step 4) are not atomic with respect to the second full_sync holding @lock. The message is written to disk before each_follower is called, so the second full_sync can observe the new data in the file and transfer it to the follower. When @lock is released, the pending append action goes through to the now-synced follower, duplicating the data.

Notes

  • The same class of issue exists for delete_file — a file could be deleted from disk (e.g. queue purge/delete) before each_follower runs. The second full_sync may then try to read an MFile that has been closed/unmapped, causing a SEGFAULT. Or the follower receives a delete for a file it never got.
  • This affects 2.7.0 where follower sync runs on a parallel execution context (@mt), making the race window more likely. In 2.6.x (single-threaded fibers), the window is narrower but still theoretically possible at yield points within full_sync.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions