fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced by corylanou · Pull Request #1169 · benbjohnson/litestream

corylanou · 2026-02-25T20:47:31Z

Description

When all WAL frames were already synced before a checkpoint restarts the WAL, verify() no longer forces a full snapshot via the forceNextSnapshot path. The syncedToWALEnd flag is now used as a guard: if the previous sync captured all WAL frames, the checkpoint's WAL restart is expected and an incremental sync suffices.

This prevents snapshot-sized LTX files (~119MB for a ~119MB database) from being created every Litestream checkpoint cycle (~5 minutes), which was a regression introduced in v0.5.9 by commit 48ecd53 (fix(db): detect WAL changes via shm mxFrame (#1087)).

The fix is scoped to the forceNextSnapshot path only (Litestream's own checkpoints). The WAL truncation path (external checkpoints) continues to always force a snapshot, since syncedToWALEnd can be stale when writes occur between the last sync and an external checkpoint.

Motivation and Context

Fixes #1165
Fixes #1171
Fixes #1175

After upgrading to v0.5.9, users reported ~119MB snapshot-sized LTX files created every few minutes in both the meta folder and replica ltx/0, without corresponding "snapshot complete" log entries. The root cause was that commit 48ecd53 added a forceNextSnapshot mechanism that unconditionally forces full snapshots after every Litestream checkpoint, even when all WAL frames were already captured.

This same root cause was independently reported as high PUT volume on idle databases (#1171) and runaway replication with restore failures (#1175).

How Has This Been Tested?

go test -race -v -run "TestDB_Verify_ForceSnapshot|TestDB_CheckpointDoesNotCreate" . — all 3 targeted tests pass
go test -race ./... — full test suite passes
go build ./... — builds cleanly

Tests:

TestDB_Verify_ForceSnapshotSkippedWhenSyncedToWALEnd (new) — forceNextSnapshot=true + syncedToWALEnd=true → no snapshot
TestDB_CheckpointDoesNotCreateSnapshotWhenFullySynced (new) — integration test: full checkpoint cycle produces small incremental LTX, not snapshot-sized
TestDB_Verify_ForceSnapshotAfterCheckpointWALRestart (updated) — explicitly sets syncedToWALEnd=false to test the unsynced case where snapshot IS needed

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (would cause existing functionality to not work as expected)

Checklist

My code follows the code style of this project (go fmt, go vet)
I have tested my changes (go test ./...)
I have updated the documentation accordingly (if needed)

github-actions · 2026-02-25T20:49:09Z

PR Build Metrics

✅ All clear — no issues detected

Check	Status	Summary
Binary size	✅	35.82 MB (+4.0 KB / +0.01%)
Dependencies	✅	No changes
Vulnerabilities	✅	None detected
Go toolchain	✅	1.24.13 (latest)
Module graph	✅	1204 edges (0)

Binary size details

	Size	Change
Base (`0b3facd`)	35.82 MB
PR (`f29d9fa`)	35.82 MB	+4.0 KB (+0.01%)

Dependency changes

No dependency changes.

govulncheck output

No vulnerabilities found.

Build info

Metric	Value
Build time	40s
Go version	`go1.24.13`
Commit	`f29d9fa`

History (6 previous)

Commit	Updated	Status	Summary
`649d9f4`	2026-03-03 19:56 UTC	✅	35.82 MB (0.0 KB / 0.00%)
`ce27ee8`	2026-02-26 19:38 UTC	✅	35.82 MB (0.0 KB / 0.00%)
`5b0569b`	2026-02-26 18:51 UTC	✅	35.82 MB (0.0 KB / 0.00%)
`931a2f9`	2026-02-26 18:07 UTC	✅	35.82 MB (+8.0 KB / +0.02%)
`609f1d9`	2026-02-26 15:47 UTC	✅	35.82 MB (+8.0 KB / +0.02%)
`09dee99`	2026-02-25 21:07 UTC	✅	35.82 MB (+8.0 KB / +0.02%)

--- _{🤖 Updated on each push.}

…ynced When all WAL frames were already synced before a checkpoint restarts the WAL, verify() no longer forces a full snapshot. The syncedToWALEnd flag is now used as a guard in both the forceNextSnapshot and WAL truncation paths, avoiding snapshot-sized LTX files every checkpoint cycle (~5 min). Fixes #1165

The WAL truncation path cannot safely use syncedToWALEnd as a guard because the flag can be stale when writes occur between the last sync and an external checkpoint. Only the forceNextSnapshot path is safe because Litestream always syncs before its own checkpoints.

Address review finding: document why checking syncedToWALEnd is safe in the forceNextSnapshot path — checkpoint() always calls verifyAndSync() immediately before, so the flag is never stale.

Add a WAL size check between the pre-checkpoint sync and checkpoint execution. If the WAL grew (concurrent writer appended frames after our sync), clear syncedToWALEnd so the post-checkpoint verify forces a full snapshot. This prevents missing pages that get checkpointed into the DB without being captured in an LTX.

…ToWALEnd After a WAL restart, the physical WAL file retains stale data from the previous generation while valid content starts from offset 32. The previous code compared finalOffset against walFileSize() which included this stale data, causing syncedToWALEnd to be incorrectly false. This caused every checkpoint to force a full snapshot even when all WAL frames had already been synced, resulting in snapshot-sized LTX files every few minutes. Changes: - sync(): Set syncedToWALEnd=true unconditionally since WALReader reads all valid frames (stopping at salt mismatch or EOF) - checkpoint(): Replace walFileSize() concurrent writer check with frame salt validation — read the frame header at lastSyncedWALOffset and verify salts match the current WAL generation Fixes #1165

db.go

…flag Address PR review: the in-memory syncedToWALEnd flag was problematic because it didn't survive Litestream restarts. Users running apps that continue when Litestream crashes would lose this state. Now verify() determines if all WAL frames were synced by: 1. Reading the last LTX file's wal_offset + wal_size and salts 2. Checking if there's a valid frame at that position with old salts 3. If matching salts exist, unsynced frames remain → force snapshot 4. If salts don't match, all frames were synced → skip snapshot This approach is persistent and survives restarts since it reads from the LTX file which is stored on disk. Changes: - verify(): Rewrite forceNextSnapshot path to use LTX-based detection - sync(): Remove syncedToWALEnd assignment (no longer needed) - checkpoint(): Remove concurrent writer check (now handled in verify) - Remove syncedToWALEnd field from DB struct - Update tests to work with new approach - Remove obsolete TestDB_Checkpoint_ConcurrentWriterClearsSyncedToWALEnd Fixes #1165

corylanou · 2026-03-03T19:54:46Z

@codex review

benbjohnson · 2026-03-05T16:37:28Z

I reverted the PR that caused this bug (#1185) and we're re-evaluating the fix.

corylanou added 3 commits February 26, 2026 12:06

fix(db): add safety comment for syncedToWALEnd guard in verify()

7666e32

Address review finding: document why checking syncedToWALEnd is safe in the forceNextSnapshot path — checkpoint() always calls verifyAndSync() immediately before, so the flag is never stale.

corylanou force-pushed the issue-1165-snapshot-sized-ltx-s-created-every-few-minutes-after-upgradi branch from 6e5dfec to 7666e32 Compare February 26, 2026 18:06

corylanou added 2 commits February 26, 2026 12:49

corylanou mentioned this pull request Feb 26, 2026

"Snapshot-sized" ltx's created every few minutes after upgrading to v0.5.9 #1165

Open

corylanou requested a review from benbjohnson February 26, 2026 19:59

corylanou added the ready for review label Feb 26, 2026

corylanou assigned benbjohnson Feb 26, 2026

benbjohnson requested changes Mar 3, 2026

View reviewed changes

db.go Outdated Show resolved Hide resolved

This was referenced Mar 3, 2026

High PUT volume despite zero SQLite writes and long sync-interval #1171

Open

Runaway replication for a database with no changes and failures to restore with v0.5.9 #1175

Open

benbjohnson closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169

fix(db): skip unnecessary snapshots after checkpoint when WAL fully synced#1169
corylanou wants to merge 6 commits intomainfrom
issue-1165-snapshot-sized-ltx-s-created-every-few-minutes-after-upgradi

corylanou commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

corylanou commented Mar 3, 2026

Uh oh!

benbjohnson commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

corylanou commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist

Uh oh!

github-actions bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Build Metrics

Uh oh!

Uh oh!

corylanou commented Mar 3, 2026

Uh oh!

benbjohnson commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

corylanou commented Feb 25, 2026 •

edited

Loading

github-actions bot commented Feb 25, 2026 •

edited

Loading