Skip to content

fix(sync): fix SnapshotReader goroutine race and fsync ordering#1166

Open
corylanou wants to merge 3 commits intomainfrom
issue-1164-database-gets-corrupted-randomly
Open

fix(sync): fix SnapshotReader goroutine race and fsync ordering#1166
corylanou wants to merge 3 commits intomainfrom
issue-1164-database-gets-corrupted-randomly

Conversation

@corylanou
Copy link
Collaborator

@corylanou corylanou commented Feb 25, 2026

Summary

Fixes two bugs related to database corruption during replication (issue #1164).

1. SnapshotReader goroutine race (db.go)

SnapshotReader() launches a goroutine to encode a full database snapshot into an LTX stream via io.Pipe. The outer function acquired chkMu.RLock() with defer RUnlock(), but the goroutine outlived the function — when the outer function returned the io.PipeReader, the defer fired and released the lock while the goroutine was still reading WAL and database pages. A concurrent checkpoint could then modify pages mid-read, producing a corrupted snapshot.

Fix: Move chkMu.RLock()/RUnlock() and the commit-size computation (f.Stat()) into the goroutine so the read lock is held for the entire duration of page encoding.

Performance impact: zero. Same read lock, correct duration. No memory copies, no allocations.

2. fsync ordering in applyLTXFile (replica.go)

Added f.Sync() before f.Truncate() to ensure all page writes are durable before the file is truncated. A crash between the page writes and truncate could otherwise lose page data during restore.

What was investigated and ruled out

We initially investigated a TOCTOU race in the sync() path (the incremental WAL replication). Analysis confirmed this path is safe:

  • db.rtx (read transaction) prevents WAL TRUNCATE/RESTART checkpoints
  • chkMu.RLock() is held synchronously for the entire sync() function
  • PASSIVE checkpoints only copy frames to the DB file without overwriting WAL frames

The earlier WAL-snapshot-into-memory approach (copying the entire WAL into a []byte) was reverted because it solved a non-existent race in sync() and would not scale for large WALs.

The actual race was in SnapshotReader(), which has a different execution model (goroutine outlives function scope).

Ref #1164

Test Plan

Unit tests (all pass with -race):

  • TestSyncRestoreIntegrity — Full sync→restore→integrity check flow with concurrent writes
  • TestSyncRestoreIntegrity_WithCheckpoints — Stress variant forcing PASSIVE checkpoints between syncs
  • TestApplyLTXFile_TruncatesAfterWrite — Verifies fsync-before-truncate ordering
  • TestApplyLTXFile_MultiplePages — Multi-page LTX application
  • Full test suite: go test -race ./... passes

Integration soak test (TestSoakReplicateRestore):

  • Mirrors fuchstim's exact setup from issue Database corrupted when restoring from backup #1164
  • WAL-mode SQLite DB with tables + indexes (_uid, _resource_version, indexed columns)
  • ~100 writes/sec concurrent load against MinIO S3
  • Every 30 seconds: stop replication → restore from S3 → PRAGMA integrity_check → restart
  • Tracks write latency P50/P95/P99/max, fails if P99 > 500ms
  • Run: go test -tags "integration,soak,docker" -run TestSoakReplicateRestore -v -timeout 10m ./tests/integration/

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

PR Build Metrics

All clear — no issues detected

Check Status Summary
Binary size 35.82 MB (0.0 KB / 0.00%)
Dependencies No changes
Vulnerabilities None detected
Go toolchain 1.24.13 (latest)
Module graph 1204 edges (0)
Binary size details
Size Change
Base (0b3facd) 35.82 MB
PR (e8ac2e9) 35.82 MB 0.0 KB (0.00%)
Dependency changes

No dependency changes.

govulncheck output
No vulnerabilities found.
Build info
Metric Value
Build time 41s
Go version go1.24.13
Commit e8ac2e9
History (8 previous)
Commit Updated Status Summary
f036905 2026-03-03 19:45 UTC 35.82 MB (0.0 KB / 0.00%)
d51bf25 2026-02-27 19:57 UTC 35.82 MB (0.0 KB / 0.00%)
9187490 2026-02-27 15:58 UTC 35.82 MB (0.0 KB / 0.00%)
3155dbe 2026-02-27 15:31 UTC 35.82 MB (0.0 KB / 0.00%)
7dae982 2026-02-26 23:24 UTC 35.82 MB (0.0 KB / 0.00%)
e2ad940 2026-02-26 17:37 UTC 35.82 MB (0.0 KB / 0.00%)
3d9be0d 2026-02-26 16:56 UTC 35.82 MB (+4.0 KB / +0.01%)
32e60c5 2026-02-25 22:46 UTC 35.82 MB (+8.0 KB / +0.02%)
--- 🤖 Updated on each push.

@corylanou corylanou changed the title fix(restore): add post-restore integrity validation and fsync ordering fix(sync): prevent WAL checkpoint race causing replica corruption Feb 25, 2026
@corylanou corylanou force-pushed the issue-1164-database-gets-corrupted-randomly branch from f311865 to f32e404 Compare February 26, 2026 16:54
@corylanou corylanou changed the title fix(sync): prevent WAL checkpoint race causing replica corruption fix(sync): buffer WAL pages to prevent checkpoint race causing replica corruption Feb 26, 2026
@corylanou corylanou force-pushed the issue-1164-database-gets-corrupted-randomly branch from f32e404 to 8905efd Compare February 26, 2026 17:35
@corylanou
Copy link
Collaborator Author

Testing against the reproduction repo

We used the schema and access patterns from @fuchstim's reproduction repo to build a deterministic test (TestSync_WALRaceCondition) that mechanically demonstrates the race condition.

The reproduction repo creates a table with 5 indexes and 6 triggers, then runs 100 ops/sec (inserts, updates, soft-deletes, outbox cleanup). Each transaction touches many WAL pages across different B-tree structures — exactly the scenario that maximizes the checkpoint race window.

What the test does

  1. Creates a DB using the same schema (table data with _uid, _resource_version, _deleted_at, name, data_json, is_active columns + all the indexes from the repro)
  2. Inserts 200 rows to generate many WAL pages across table and index B-trees
  3. Reads the WAL using both approaches:
    • OLD: Records page offsets (map[uint32]int64) — what the code did before this PR
    • NEW: Buffers page data (map[uint32][]byte) — what the code does now
  4. Overwrites all WAL frames with garbage (simulating a PASSIVE checkpoint rewriting the WAL)
  5. Compares results

Test output

buffered PageMap: 10 pages, commit=13, maxOffset=1709832
old offset map: 10 page offsets recorded
corrupted 416 WAL frames to simulate checkpoint

OLD offset-based approach: 10/10 pages corrupted by checkpoint simulation
NEW buffered approach: 0/10 pages corrupted

LTX file valid: 10 pages encoded and decoded successfully

The old approach produces 10/10 corrupted pages when the WAL is modified — this is what was happening in production. The new approach produces 0/10 corrupted pages because the data was already captured during the initial read.

Thanks again to @fuchstim for the detailed trace logs, reproduction repo, and all the debugging — it made pinpointing this race condition much faster.

@corylanou corylanou force-pushed the issue-1164-database-gets-corrupted-randomly branch from 8905efd to 2a735f6 Compare February 26, 2026 23:22
@corylanou corylanou changed the title fix(sync): buffer WAL pages to prevent checkpoint race causing replica corruption fix(sync): snapshot WAL into memory to prevent checkpoint race causing replica corruption Feb 26, 2026
@corylanou corylanou changed the title fix(sync): snapshot WAL into memory to prevent checkpoint race causing replica corruption fix(follow): sync pages to disk before truncate Feb 27, 2026
Add f.Sync() before f.Truncate() in applyLTXFile() to ensure pages are
durably written before the file is truncated. Without this ordering, a
crash between truncate and the implicit close-sync could leave the LTX
file with missing pages.

Add sync→restore integrity tests that exercise concurrent writes,
checkpoints, and syncs then verify the restored database passes
PRAGMA integrity_check with correct row counts. These tests reproduce
the scenario from issue #1164.

Ref #1164
SnapshotReader() launches a goroutine to encode a full database snapshot
into an LTX stream via an io.Pipe. Previously, chkMu.RLock() was
acquired in the outer function with defer RUnlock(). When the outer
function returned the io.PipeReader to the caller, the defer fired and
released the lock — while the goroutine was still reading WAL and
database pages. This allowed a concurrent checkpoint to modify pages
mid-read, producing a corrupted snapshot.

Move chkMu.RLock()/RUnlock() and the commit-size computation (f.Stat())
into the goroutine so the read lock is held for the entire duration of
page encoding. This has zero performance impact — it is the same read
lock held for the correct duration rather than being released early.

Analysis of the two replication code paths:
- sync() (line 1470): chkMu.RLock held synchronously for the entire
  function. db.rtx prevents TRUNCATE/RESTART checkpoints. No race.
- SnapshotReader() (line 1841): chkMu.RLock was released when the
  outer function returned, before the goroutine finished. Race exists.
  This is what this commit fixes.

Add TestSoakReplicateRestore integration test that mirrors the exact
setup from issue #1164: WAL-mode DB with indexed tables, ~100 writes/sec,
periodic stop→restore→integrity_check cycles against MinIO S3.

Ref #1164
@corylanou corylanou force-pushed the issue-1164-database-gets-corrupted-randomly branch from 1d27456 to f8beb02 Compare February 27, 2026 19:56
@corylanou corylanou changed the title fix(follow): sync pages to disk before truncate fix(sync): fix SnapshotReader goroutine race and fsync ordering Feb 27, 2026
Fix YAML config for restore in TestSoakReplicateRestore — use proper
replicas: list format instead of the CreateSoakConfig helper which
generates invalid indentation for the restore command.

Fix undefined criticalErrors in minio_soak_test.go and
comprehensive_soak_test.go — use len(errors) from CheckForErrors().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants