fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169
Closed
fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169
Conversation
PR Build Metrics✅ All clear — no issues detected
Binary size details
Dependency changesNo dependency changes. govulncheck outputBuild info
History (6 previous)
|
…ynced When all WAL frames were already synced before a checkpoint restarts the WAL, verify() no longer forces a full snapshot. The syncedToWALEnd flag is now used as a guard in both the forceNextSnapshot and WAL truncation paths, avoiding snapshot-sized LTX files every checkpoint cycle (~5 min). Fixes #1165
The WAL truncation path cannot safely use syncedToWALEnd as a guard because the flag can be stale when writes occur between the last sync and an external checkpoint. Only the forceNextSnapshot path is safe because Litestream always syncs before its own checkpoints.
Address review finding: document why checking syncedToWALEnd is safe in the forceNextSnapshot path — checkpoint() always calls verifyAndSync() immediately before, so the flag is never stale.
6e5dfec to
7666e32
Compare
Add a WAL size check between the pre-checkpoint sync and checkpoint execution. If the WAL grew (concurrent writer appended frames after our sync), clear syncedToWALEnd so the post-checkpoint verify forces a full snapshot. This prevents missing pages that get checkpointed into the DB without being captured in an LTX.
…ToWALEnd After a WAL restart, the physical WAL file retains stale data from the previous generation while valid content starts from offset 32. The previous code compared finalOffset against walFileSize() which included this stale data, causing syncedToWALEnd to be incorrectly false. This caused every checkpoint to force a full snapshot even when all WAL frames had already been synced, resulting in snapshot-sized LTX files every few minutes. Changes: - sync(): Set syncedToWALEnd=true unconditionally since WALReader reads all valid frames (stopping at salt mismatch or EOF) - checkpoint(): Replace walFileSize() concurrent writer check with frame salt validation — read the frame header at lastSyncedWALOffset and verify salts match the current WAL generation Fixes #1165
benbjohnson
requested changes
Mar 3, 2026
…flag Address PR review: the in-memory syncedToWALEnd flag was problematic because it didn't survive Litestream restarts. Users running apps that continue when Litestream crashes would lose this state. Now verify() determines if all WAL frames were synced by: 1. Reading the last LTX file's wal_offset + wal_size and salts 2. Checking if there's a valid frame at that position with old salts 3. If matching salts exist, unsynced frames remain → force snapshot 4. If salts don't match, all frames were synced → skip snapshot This approach is persistent and survives restarts since it reads from the LTX file which is stored on disk. Changes: - verify(): Rewrite forceNextSnapshot path to use LTX-based detection - sync(): Remove syncedToWALEnd assignment (no longer needed) - checkpoint(): Remove concurrent writer check (now handled in verify) - Remove syncedToWALEnd field from DB struct - Update tests to work with new approach - Remove obsolete TestDB_Checkpoint_ConcurrentWriterClearsSyncedToWALEnd Fixes #1165
This was referenced Mar 3, 2026
Collaborator
Author
|
@codex review |
Owner
|
I reverted the PR that caused this bug (#1185) and we're re-evaluating the fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When all WAL frames were already synced before a checkpoint restarts the WAL,
verify()no longer forces a full snapshot via theforceNextSnapshotpath. ThesyncedToWALEndflag is now used as a guard: if the previous sync captured all WAL frames, the checkpoint's WAL restart is expected and an incremental sync suffices.This prevents snapshot-sized LTX files (~119MB for a ~119MB database) from being created every Litestream checkpoint cycle (~5 minutes), which was a regression introduced in v0.5.9 by commit 48ecd53 (
fix(db): detect WAL changes via shm mxFrame (#1087)).The fix is scoped to the
forceNextSnapshotpath only (Litestream's own checkpoints). The WAL truncation path (external checkpoints) continues to always force a snapshot, sincesyncedToWALEndcan be stale when writes occur between the last sync and an external checkpoint.Motivation and Context
Fixes #1165
Fixes #1171
Fixes #1175
After upgrading to v0.5.9, users reported ~119MB snapshot-sized LTX files created every few minutes in both the meta folder and replica
ltx/0, without corresponding "snapshot complete" log entries. The root cause was that commit 48ecd53 added aforceNextSnapshotmechanism that unconditionally forces full snapshots after every Litestream checkpoint, even when all WAL frames were already captured.This same root cause was independently reported as high PUT volume on idle databases (#1171) and runaway replication with restore failures (#1175).
How Has This Been Tested?
go test -race -v -run "TestDB_Verify_ForceSnapshot|TestDB_CheckpointDoesNotCreate" .— all 3 targeted tests passgo test -race ./...— full test suite passesgo build ./...— builds cleanlyTests:
TestDB_Verify_ForceSnapshotSkippedWhenSyncedToWALEnd(new) —forceNextSnapshot=true+syncedToWALEnd=true→ no snapshotTestDB_CheckpointDoesNotCreateSnapshotWhenFullySynced(new) — integration test: full checkpoint cycle produces small incremental LTX, not snapshot-sizedTestDB_Verify_ForceSnapshotAfterCheckpointWALRestart(updated) — explicitly setssyncedToWALEnd=falseto test the unsynced case where snapshot IS neededTypes of changes
Checklist
go fmt,go vet)go test ./...)