feat(checkpoint): add idempotent commit support for Delta#6872
feat(checkpoint): add idempotent commit support for Delta#6872chenghuichen wants to merge 5 commits into
Conversation
Greptile SummaryThis PR extracts the connector-agnostic checkpoint-commit orchestration from the Iceberg write path into a shared
Confidence Score: 3/5Not safe to merge without addressing the history-limit cap that can cause silent data duplication on recovery One P1 correctness defect (history limit in the Delta Lake recovery check can produce duplicate commits) pulls the score below the P1 ceiling of 4; the rest are P2 style issues daft/dataframe/dataframe.py — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant commit_with_checkpoint
participant CheckpointStore
participant decode_files
participant refresh_and_check_committed
participant commit_files
Caller->>commit_with_checkpoint: write_df, checkpoint, callbacks
commit_with_checkpoint->>CheckpointStore: list_checkpoints()
alt No pending entries (fresh run)
commit_with_checkpoint->>commit_with_checkpoint: write_df.collect()
commit_with_checkpoint->>CheckpointStore: list_checkpoints() [again]
end
alt Still no pending entries
commit_with_checkpoint-->>Caller: _empty_write_result()
end
commit_with_checkpoint->>CheckpointStore: get_checkpointed_files()
commit_with_checkpoint->>decode_files: file_metadata blobs
decode_files-->>commit_with_checkpoint: connector file objects
alt No file objects
commit_with_checkpoint->>CheckpointStore: mark_committed()
commit_with_checkpoint-->>Caller: _empty_write_result()
end
loop up to max_retries
commit_with_checkpoint->>refresh_and_check_committed: store_path, query_id
alt Already committed (recovery)
refresh_and_check_committed-->>commit_with_checkpoint: True
commit_with_checkpoint->>CheckpointStore: mark_committed()
commit_with_checkpoint-->>Caller: _build_result_df()
else Not yet committed
refresh_and_check_committed-->>commit_with_checkpoint: False
commit_with_checkpoint->>commit_files: files, store_path, query_id
alt Commit succeeds
commit_files-->>commit_with_checkpoint: OK
commit_with_checkpoint->>CheckpointStore: mark_committed()
commit_with_checkpoint-->>Caller: _build_result_df()
else Retryable error
commit_files-->>commit_with_checkpoint: error
Note over commit_with_checkpoint: retry
end
end
end
|
Merging this PR will not alter performance
Comparing Footnotes
|
|
Hey @chenghuichen — apologies for sitting on this. I was unwell + OOO for a few days, and in the meantime we shaped a design change (PR #6905) that affects the foundations this PR is built on. Wanted to give you a heads-up before you spend more time on it. Why the changeThe original design conflated execution identity (which run staged a row) with commit identity (which logical commit a snapshot represents). They're independent concerns:
Decoupling them lets recovery be keyed on a single user-supplied token (the What changed in #69051. API shape — paired kwargs collapsed into a single strong type: # Before (this PR):
df.write_iceberg(table, checkpoint=ckpt, idempotence_key="run-1")
# After:
df.write_iceberg(
table,
checkpoint=daft.IdempotentCommit(store=ckpt, idempotence_key="run-1"),
)
2. Snapshot marker: 3. Per-entry 4. Check-first commit flow. Walk snapshot history for the marker before running the pipeline. Found → mark Checkpointed → Committed and bail (no pipeline run). Not found → run pipeline, commit from store, mark. What this means for #6872
Happy to pair on this once #6905 merges. Sorry again for the delay. |
|
Quick update on this PR, @chenghuichen — change of plan from my last comment. Rather than ask you to do the rework against PR #6905 yourself, we'll consolidate the work into a follow-up PR we drive. Two reasons:
The architectural direction in this PR is what we're following. Specifically:
What changes (mostly mechanics, since #6905 reshaped the underlying API):
Plan:
Apologies again for the calendar friction — your contribution is meaningfully shaping what we're shipping. Will tag you on the consolidated PR when it's up. |
|
@rohitkulshreshtha Hope you're feeling better! The execution-identity vs commit-identity split makes a lot of sense — wish I'd caught that distinction during my original review. |
Changes Made
Extract shared checkpoint boilerplate (pending detection, query_id validation, file decode) into lightweight utility functions in
_checkpoint_commit.py, then apply the same idempotent commit mechanism towrite_deltalake— crash recovery via commit-info markers, skip-on-restart, incremental dedup. Iceberg's existing checkpoint flow is preserved, just calls the shared utilities instead of inlining the same bookkeeping.Related Issues