core/storage: Fix truncate checkpoint error handling#7384
Draft
penberg wants to merge 7 commits into
Draft
Conversation
af4ca3d to
9b4e0b1
Compare
43243c4 to
23675b8
Compare
…checkpoint truncate_log's truncate and post-truncation sync completion callbacks only logged I/O failures and the function returned Ok unconditionally, so a failed truncate or sync was reported to the caller as a successfully emptied log. Capture each completion's error in a OnceLock cell (matching the existing InflightWriteBatch pattern) and return it as a checkpoint error, the same way SQLite's TRUNCATE checkpoint propagates the result of sqlite3OsTruncate. The in-memory wal_max_frame/wal_total_backfilled reset stays ahead of the on-disk truncate, mirroring SQLite's walRestartHdr ordering, which deliberately makes the wal-index header authoritative before the file is truncated. Found by Aretta.
… survive it The ftruncate() fault-injection plumbing existed but truncate_fault_probability was left at its 0.0 default in every preset and exposed by no CLI flag, so the explore/whopper entry point never actually injected a truncate fault. Wire it into the fast (1%), chaos (1%), and ragnarök (1%) presets so the WAL truncate/sync paths get exercised, and print the active probability at startup like the cosmic-ray knob. With injection on, an orphaned-frame WAL truncate in prepare_wal_start can fail. The engine already propagates that as a clean LimboError (via the completion group surfaced through wait_for_completion and the VDBE driver's get_error check), but perform_work treated every IoError as fatal and aborted the run on the first injected fault. Treat a CompletionError as recoverable - rolling the operation back, matching the engine's own abort - but only while fault injection is active, so genuine I/O errors still fail the run when it is off. Also clarify the prepare_wal_start truncate callback comment: the warn-only callback looks like it swallows the error, but the error is propagated via the returned CompletionGroup.
…ors mid-flight A TRUNCATE checkpoint defers the WAL-file truncate to the pager and only releases its checkpoint guard in the Finalize phase. When the checkpoint runs via blocking_checkpoint, io.block drives the state machine and surfaces a completion error (e.g. a failed WAL ftruncate) straight out of io.wait without re-entering checkpoint(), so Finalize never runs. The CheckpointResult is left in checkpoint_state holding the guard, which owns the exclusive WRITER and read-mark-0 locks. Every later reader then fails to take a shared lock on read-mark 0, spins, and returns Busy - bricking the database after an otherwise recoverable I/O error. Reset the checkpoint state on a blocking-checkpoint error, dropping the lingering CheckpointResult and releasing the locks. The data is already durable in the synced DB file; any un-truncated WAL frames are stale and are cleaned up on the next write. Found by the concurrent simulator with ftruncate fault injection (seed 18150844671727610101).
… injection An injected ftruncate fault during a checkpoint surfaces as a deliberately propagated LimboError::CheckpointFailed (the write is durable in the WAL, only the checkpoint maintenance op failed). The in-process perform_work had no arm for it and crashed on `_ => return Err`, even though the multiprocess simulator already tolerates it. Handle it the same way as the other recoverable errors, gated on fault injection so a genuine CheckpointFailed still aborts the run when injection is off. Verified the async checkpoint path does not leak the checkpoint guard the way the blocking path did: the VDBE driver calls abort() (which releases the checkpoint state), so the harness tolerance alone lets the run continue cleanly. Found by the concurrent simulator (seed 15400526715844541172).
3b8b8b9 to
4aeec67
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Propagate ftruncate() and fsync() errors to callers on truncate checkpoint errors. As the issue was discovered by Aretta, add ftruncate() fault injection to
unreliable-libcandconcurrent-simulator. Also add truncate checkpoint support toturso-stress.