Skip to content

fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169

Closed
corylanou wants to merge 6 commits intomainfrom
issue-1165-snapshot-sized-ltx-s-created-every-few-minutes-after-upgradi
Closed

fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169
corylanou wants to merge 6 commits intomainfrom
issue-1165-snapshot-sized-ltx-s-created-every-few-minutes-after-upgradi

Conversation

@corylanou
Copy link
Collaborator

@corylanou corylanou commented Feb 25, 2026

Description

When all WAL frames were already synced before a checkpoint restarts the WAL, verify() no longer forces a full snapshot via the forceNextSnapshot path. The syncedToWALEnd flag is now used as a guard: if the previous sync captured all WAL frames, the checkpoint's WAL restart is expected and an incremental sync suffices.

This prevents snapshot-sized LTX files (~119MB for a ~119MB database) from being created every Litestream checkpoint cycle (~5 minutes), which was a regression introduced in v0.5.9 by commit 48ecd53 (fix(db): detect WAL changes via shm mxFrame (#1087)).

The fix is scoped to the forceNextSnapshot path only (Litestream's own checkpoints). The WAL truncation path (external checkpoints) continues to always force a snapshot, since syncedToWALEnd can be stale when writes occur between the last sync and an external checkpoint.

Motivation and Context

Fixes #1165
Fixes #1171
Fixes #1175

After upgrading to v0.5.9, users reported ~119MB snapshot-sized LTX files created every few minutes in both the meta folder and replica ltx/0, without corresponding "snapshot complete" log entries. The root cause was that commit 48ecd53 added a forceNextSnapshot mechanism that unconditionally forces full snapshots after every Litestream checkpoint, even when all WAL frames were already captured.

This same root cause was independently reported as high PUT volume on idle databases (#1171) and runaway replication with restore failures (#1175).

How Has This Been Tested?

  • go test -race -v -run "TestDB_Verify_ForceSnapshot|TestDB_CheckpointDoesNotCreate" . — all 3 targeted tests pass
  • go test -race ./... — full test suite passes
  • go build ./... — builds cleanly

Tests:

  1. TestDB_Verify_ForceSnapshotSkippedWhenSyncedToWALEnd (new) — forceNextSnapshot=true + syncedToWALEnd=true → no snapshot
  2. TestDB_CheckpointDoesNotCreateSnapshotWhenFullySynced (new) — integration test: full checkpoint cycle produces small incremental LTX, not snapshot-sized
  3. TestDB_Verify_ForceSnapshotAfterCheckpointWALRestart (updated) — explicitly sets syncedToWALEnd=false to test the unsynced case where snapshot IS needed

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (would cause existing functionality to not work as expected)

Checklist

  • My code follows the code style of this project (go fmt, go vet)
  • I have tested my changes (go test ./...)
  • I have updated the documentation accordingly (if needed)

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

PR Build Metrics

All clear — no issues detected

Check Status Summary
Binary size 35.82 MB (+4.0 KB / +0.01%)
Dependencies No changes
Vulnerabilities None detected
Go toolchain 1.24.13 (latest)
Module graph 1204 edges (0)
Binary size details
Size Change
Base (0b3facd) 35.82 MB
PR (f29d9fa) 35.82 MB +4.0 KB (+0.01%)
Dependency changes

No dependency changes.

govulncheck output
No vulnerabilities found.
Build info
Metric Value
Build time 40s
Go version go1.24.13
Commit f29d9fa
History (6 previous)
Commit Updated Status Summary
649d9f4 2026-03-03 19:56 UTC 35.82 MB (0.0 KB / 0.00%)
ce27ee8 2026-02-26 19:38 UTC 35.82 MB (0.0 KB / 0.00%)
5b0569b 2026-02-26 18:51 UTC 35.82 MB (0.0 KB / 0.00%)
931a2f9 2026-02-26 18:07 UTC 35.82 MB (+8.0 KB / +0.02%)
609f1d9 2026-02-26 15:47 UTC 35.82 MB (+8.0 KB / +0.02%)
09dee99 2026-02-25 21:07 UTC 35.82 MB (+8.0 KB / +0.02%)
--- 🤖 Updated on each push.

…ynced

When all WAL frames were already synced before a checkpoint restarts the
WAL, verify() no longer forces a full snapshot. The syncedToWALEnd flag
is now used as a guard in both the forceNextSnapshot and WAL truncation
paths, avoiding snapshot-sized LTX files every checkpoint cycle (~5 min).

Fixes #1165
The WAL truncation path cannot safely use syncedToWALEnd as a guard
because the flag can be stale when writes occur between the last sync
and an external checkpoint. Only the forceNextSnapshot path is safe
because Litestream always syncs before its own checkpoints.
Address review finding: document why checking syncedToWALEnd is safe
in the forceNextSnapshot path — checkpoint() always calls
verifyAndSync() immediately before, so the flag is never stale.
@corylanou corylanou force-pushed the issue-1165-snapshot-sized-ltx-s-created-every-few-minutes-after-upgradi branch from 6e5dfec to 7666e32 Compare February 26, 2026 18:06
Add a WAL size check between the pre-checkpoint sync and checkpoint
execution. If the WAL grew (concurrent writer appended frames after
our sync), clear syncedToWALEnd so the post-checkpoint verify forces
a full snapshot. This prevents missing pages that get checkpointed
into the DB without being captured in an LTX.
…ToWALEnd

After a WAL restart, the physical WAL file retains stale data from the
previous generation while valid content starts from offset 32. The
previous code compared finalOffset against walFileSize() which included
this stale data, causing syncedToWALEnd to be incorrectly false.

This caused every checkpoint to force a full snapshot even when all WAL
frames had already been synced, resulting in snapshot-sized LTX files
every few minutes.

Changes:
- sync(): Set syncedToWALEnd=true unconditionally since WALReader reads
  all valid frames (stopping at salt mismatch or EOF)
- checkpoint(): Replace walFileSize() concurrent writer check with frame
  salt validation — read the frame header at lastSyncedWALOffset and
  verify salts match the current WAL generation

Fixes #1165
…flag

Address PR review: the in-memory syncedToWALEnd flag was problematic
because it didn't survive Litestream restarts. Users running apps that
continue when Litestream crashes would lose this state.

Now verify() determines if all WAL frames were synced by:
1. Reading the last LTX file's wal_offset + wal_size and salts
2. Checking if there's a valid frame at that position with old salts
3. If matching salts exist, unsynced frames remain → force snapshot
4. If salts don't match, all frames were synced → skip snapshot

This approach is persistent and survives restarts since it reads from
the LTX file which is stored on disk.

Changes:
- verify(): Rewrite forceNextSnapshot path to use LTX-based detection
- sync(): Remove syncedToWALEnd assignment (no longer needed)
- checkpoint(): Remove concurrent writer check (now handled in verify)
- Remove syncedToWALEnd field from DB struct
- Update tests to work with new approach
- Remove obsolete TestDB_Checkpoint_ConcurrentWriterClearsSyncedToWALEnd

Fixes #1165
@corylanou
Copy link
Collaborator Author

@codex review

@benbjohnson
Copy link
Owner

I reverted the PR that caused this bug (#1185) and we're re-evaluating the fix.

@benbjohnson benbjohnson closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants