fix(sync): fix SnapshotReader goroutine race and fsync ordering#1166
fix(sync): fix SnapshotReader goroutine race and fsync ordering#1166
Conversation
PR Build Metrics✅ All clear — no issues detected
Binary size details
Dependency changesNo dependency changes. govulncheck outputBuild info
History (8 previous)
|
f311865 to
f32e404
Compare
f32e404 to
8905efd
Compare
Testing against the reproduction repoWe used the schema and access patterns from @fuchstim's reproduction repo to build a deterministic test ( The reproduction repo creates a table with 5 indexes and 6 triggers, then runs 100 ops/sec (inserts, updates, soft-deletes, outbox cleanup). Each transaction touches many WAL pages across different B-tree structures — exactly the scenario that maximizes the checkpoint race window. What the test does
Test outputThe old approach produces 10/10 corrupted pages when the WAL is modified — this is what was happening in production. The new approach produces 0/10 corrupted pages because the data was already captured during the initial read. Thanks again to @fuchstim for the detailed trace logs, reproduction repo, and all the debugging — it made pinpointing this race condition much faster. |
8905efd to
2a735f6
Compare
Add f.Sync() before f.Truncate() in applyLTXFile() to ensure pages are durably written before the file is truncated. Without this ordering, a crash between truncate and the implicit close-sync could leave the LTX file with missing pages. Add sync→restore integrity tests that exercise concurrent writes, checkpoints, and syncs then verify the restored database passes PRAGMA integrity_check with correct row counts. These tests reproduce the scenario from issue #1164. Ref #1164
SnapshotReader() launches a goroutine to encode a full database snapshot into an LTX stream via an io.Pipe. Previously, chkMu.RLock() was acquired in the outer function with defer RUnlock(). When the outer function returned the io.PipeReader to the caller, the defer fired and released the lock — while the goroutine was still reading WAL and database pages. This allowed a concurrent checkpoint to modify pages mid-read, producing a corrupted snapshot. Move chkMu.RLock()/RUnlock() and the commit-size computation (f.Stat()) into the goroutine so the read lock is held for the entire duration of page encoding. This has zero performance impact — it is the same read lock held for the correct duration rather than being released early. Analysis of the two replication code paths: - sync() (line 1470): chkMu.RLock held synchronously for the entire function. db.rtx prevents TRUNCATE/RESTART checkpoints. No race. - SnapshotReader() (line 1841): chkMu.RLock was released when the outer function returned, before the goroutine finished. Race exists. This is what this commit fixes. Add TestSoakReplicateRestore integration test that mirrors the exact setup from issue #1164: WAL-mode DB with indexed tables, ~100 writes/sec, periodic stop→restore→integrity_check cycles against MinIO S3. Ref #1164
1d27456 to
f8beb02
Compare
Fix YAML config for restore in TestSoakReplicateRestore — use proper replicas: list format instead of the CreateSoakConfig helper which generates invalid indentation for the restore command. Fix undefined criticalErrors in minio_soak_test.go and comprehensive_soak_test.go — use len(errors) from CheckForErrors().
Summary
Fixes two bugs related to database corruption during replication (issue #1164).
1. SnapshotReader goroutine race (
db.go)SnapshotReader()launches a goroutine to encode a full database snapshot into an LTX stream viaio.Pipe. The outer function acquiredchkMu.RLock()withdefer RUnlock(), but the goroutine outlived the function — when the outer function returned theio.PipeReader, the defer fired and released the lock while the goroutine was still reading WAL and database pages. A concurrent checkpoint could then modify pages mid-read, producing a corrupted snapshot.Fix: Move
chkMu.RLock()/RUnlock()and the commit-size computation (f.Stat()) into the goroutine so the read lock is held for the entire duration of page encoding.Performance impact: zero. Same read lock, correct duration. No memory copies, no allocations.
2. fsync ordering in applyLTXFile (
replica.go)Added
f.Sync()beforef.Truncate()to ensure all page writes are durable before the file is truncated. A crash between the page writes and truncate could otherwise lose page data during restore.What was investigated and ruled out
We initially investigated a TOCTOU race in the
sync()path (the incremental WAL replication). Analysis confirmed this path is safe:db.rtx(read transaction) prevents WAL TRUNCATE/RESTART checkpointschkMu.RLock()is held synchronously for the entiresync()functionThe earlier WAL-snapshot-into-memory approach (copying the entire WAL into a
[]byte) was reverted because it solved a non-existent race insync()and would not scale for large WALs.The actual race was in
SnapshotReader(), which has a different execution model (goroutine outlives function scope).Ref #1164
Test Plan
Unit tests (all pass with
-race):TestSyncRestoreIntegrity— Full sync→restore→integrity check flow with concurrent writesTestSyncRestoreIntegrity_WithCheckpoints— Stress variant forcing PASSIVE checkpoints between syncsTestApplyLTXFile_TruncatesAfterWrite— Verifies fsync-before-truncate orderingTestApplyLTXFile_MultiplePages— Multi-page LTX applicationgo test -race ./...passesIntegration soak test (
TestSoakReplicateRestore):_uid,_resource_version, indexed columns)PRAGMA integrity_check→ restartgo test -tags "integration,soak,docker" -run TestSoakReplicateRestore -v -timeout 10m ./tests/integration/