core/storage: Fix truncate checkpoint error handling by penberg · Pull Request #7384 · tursodatabase/turso

penberg · 2026-06-05T08:39:08Z

Propagate ftruncate() and fsync() errors to callers on truncate checkpoint errors. As the issue was discovered by Aretta, add ftruncate() fault injection to unreliable-libc and concurrent-simulator. Also add truncate checkpoint support to turso-stress.

…checkpoint truncate_log's truncate and post-truncation sync completion callbacks only logged I/O failures and the function returned Ok unconditionally, so a failed truncate or sync was reported to the caller as a successfully emptied log. Capture each completion's error in a OnceLock cell (matching the existing InflightWriteBatch pattern) and return it as a checkpoint error, the same way SQLite's TRUNCATE checkpoint propagates the result of sqlite3OsTruncate. The in-memory wal_max_frame/wal_total_backfilled reset stays ahead of the on-disk truncate, mirroring SQLite's walRestartHdr ordering, which deliberately makes the wal-index header authoritative before the file is truncated. Found by Aretta.

… survive it The ftruncate() fault-injection plumbing existed but truncate_fault_probability was left at its 0.0 default in every preset and exposed by no CLI flag, so the explore/whopper entry point never actually injected a truncate fault. Wire it into the fast (1%), chaos (1%), and ragnarök (1%) presets so the WAL truncate/sync paths get exercised, and print the active probability at startup like the cosmic-ray knob. With injection on, an orphaned-frame WAL truncate in prepare_wal_start can fail. The engine already propagates that as a clean LimboError (via the completion group surfaced through wait_for_completion and the VDBE driver's get_error check), but perform_work treated every IoError as fatal and aborted the run on the first injected fault. Treat a CompletionError as recoverable - rolling the operation back, matching the engine's own abort - but only while fault injection is active, so genuine I/O errors still fail the run when it is off. Also clarify the prepare_wal_start truncate callback comment: the warn-only callback looks like it swallows the error, but the error is propagated via the returned CompletionGroup.

…ors mid-flight A TRUNCATE checkpoint defers the WAL-file truncate to the pager and only releases its checkpoint guard in the Finalize phase. When the checkpoint runs via blocking_checkpoint, io.block drives the state machine and surfaces a completion error (e.g. a failed WAL ftruncate) straight out of io.wait without re-entering checkpoint(), so Finalize never runs. The CheckpointResult is left in checkpoint_state holding the guard, which owns the exclusive WRITER and read-mark-0 locks. Every later reader then fails to take a shared lock on read-mark 0, spins, and returns Busy - bricking the database after an otherwise recoverable I/O error. Reset the checkpoint state on a blocking-checkpoint error, dropping the lingering CheckpointResult and releasing the locks. The data is already durable in the synced DB file; any un-truncated WAL frames are stale and are cleaned up on the next write. Found by the concurrent simulator with ftruncate fault injection (seed 18150844671727610101).

… injection An injected ftruncate fault during a checkpoint surfaces as a deliberately propagated LimboError::CheckpointFailed (the write is durable in the WAL, only the checkpoint maintenance op failed). The in-process perform_work had no arm for it and crashed on `_ => return Err`, even though the multiprocess simulator already tolerates it. Handle it the same way as the other recoverable errors, gated on fault injection so a genuine CheckpointFailed still aborts the run when injection is off. Verified the async checkpoint path does not leak the checkpoint guard the way the blocking path did: the VDBE driver calls abort() (which releases the checkpoint state), so the harness tolerance alone lets the run continue cleanly. Found by the concurrent simulator (seed 15400526715844541172).

github-actions Bot added the antithesis label Jun 5, 2026

penberg changed the title ~~truncate checkpoint errors~~ Fix truncate checkpoint error handling Jun 5, 2026

penberg marked this pull request as draft June 5, 2026 08:39

penberg force-pushed the truncate-checkpoint-errors branch from af4ca3d to 9b4e0b1 Compare June 5, 2026 10:38

github-actions Bot added core Storage labels Jun 5, 2026

penberg changed the title ~~Fix truncate checkpoint error handling~~ core/storage: Fix truncate checkpoint error handling Jun 5, 2026

penberg force-pushed the truncate-checkpoint-errors branch 4 times, most recently from 43243c4 to 23675b8 Compare June 6, 2026 08:39

penberg added 7 commits June 9, 2026 09:21

unreliable-libc: Add fault injection to ftruncate()

bb389e3

concurrent-simulator: Add ftruncate() fault injection

25adfa9

stress: Add truncate checkpoint support

37770c5

penberg force-pushed the truncate-checkpoint-errors branch from 3b8b8b9 to 4aeec67 Compare June 9, 2026 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/storage: Fix truncate checkpoint error handling#7384

core/storage: Fix truncate checkpoint error handling#7384
penberg wants to merge 7 commits into
mainfrom
truncate-checkpoint-errors

penberg commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

penberg commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

penberg commented Jun 5, 2026 •

edited

Loading