Skip to content

Remove blocking IO from mvcc bootstrap and recovery paths#7349

Open
PThorpe92 wants to merge 5 commits into
tursodatabase:mainfrom
PThorpe92:nonblock
Open

Remove blocking IO from mvcc bootstrap and recovery paths#7349
PThorpe92 wants to merge 5 commits into
tursodatabase:mainfrom
PThorpe92:nonblock

Conversation

@PThorpe92

Copy link
Copy Markdown
Collaborator

Lift blocking IO part 2: schema reparse + MVCC bootstrap/recovery

Removes synchronous io.block/wait_for_completion from the MVCC open path so it no longer hangs on backends without a synchronous IO pump (e.g. WASM, where io.step() is a no-op). The whole bootstrap → checkpoint-reconcile → schema
reparse → metadata-table init → logical-log replay sequence now yields IO cooperatively via IOResult instead of blocking.

Major changes

Non-blocking statement runner

  • Statement::run_ignore_rows_nonblock / run_with_row_callback_nonblock:
    drive a statement to completion returning IOResult, surfacing the pending
    completion (via take_io_completions) instead of pumping io.step(). The
    VDBE interpreter already yields StepResult::IO; these just bridge it to the
    IOResult model. No interpreter changes.

Schema reparse

  • reparse_schema / reparse_schema_with_cookie lifted to a ReparseSchemaState
    machine (read cookie → scan sqlite_schema → load custom types → refresh
    stats), carrying the half-built schema + captured table-valued functions +
    reparse guard across yields.
  • parse_schema_rows, refresh_analyze_stats, and read_current_schema_cookie
    lifted to IOResult accordingly.

MVCC bootstrap

  • bootstrap_nonblock reworked into a linear state machine that drives, all
    non-blocking: interrupted-checkpoint reconciliation, schema reparse, the
    persistent_tx_ts_max read, metadata-table create/seed, and the metadata IO
    chain. Dead blocking shims removed.

Logical-log recovery

  • maybe_recover_logical_log lifted to a RecoverLogicalLogState machine. Setup
    phases (header / tx-ts / cookie / sqlite_schema scan) yield IO; the replay
    loop carries its accumulators in RecoverCtx across per-frame next_frame
    yields. The per-frame replay body is unchanged in behavior. Driven from a new
    bootstrap Recover phase.

Checkpoint hang fix (mvcc/database/checkpoint_state_machine.rs path)

  • maybe_complete_interrupted_checkpoint's DriveEarlyTruncate was recreating
    its CheckpointResult on every re-entry, so truncate_wal (which tracks
    truncate/sync progress through that struct) never completed and spun under
    io.block. The result now persists in the state variant.

@PThorpe92

PThorpe92 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Will kick off an Antithesis run from this branch once the CI is green.

Whopper seems to have surfaced an issue here with recovery, fixing

Will address the perf regression as best I can too

@codspeed-hq

codspeed-hq Bot commented Jun 3, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 638 untouched benchmarks
⏩ 105 skipped benchmarks1


Comparing PThorpe92:nonblock (8551c0d) with main (c70a032)

Open in CodSpeed

Footnotes

  1. 105 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Ok(Some((value, bytes, len))) => (value, bytes, len),
Ok(None) => return Ok(PayloadParseResult::Eof),
Err(err) => return Err(err),
match return_if_io!(self.consume_varint_bytes()) {

@PThorpe92 PThorpe92 Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: looks to be that this function is not safely re-entrant

it was re-entrant it was just stupid because it re-parsed the whole frame if any frame yielded mid-way, so a real state machine was added

SJYX added a commit to SJYX/Momo-Study-Agent that referenced this pull request Jun 6, 2026
When wait > 300s, now checks:
- Active TCP connections at connect() return → server-side long-poll
- No connections + network_sent ≈ 0 → local blocking (checkpoint/reconcile/schema)

Previous report incorrectly said 'long-poll' when data shows no TCP
connections and zero bytes sent — the 318s is pyturso internal
blocking IO (cf. tursodatabase/turso#7349), not waiting for server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant