fix(dedup): serialize in-process Delta commits to unwedge dedup sweeps#68
fix(dedup): serialize in-process Delta commits to unwedge dedup sweeps#68tonyalaribe wants to merge 1 commit into
Conversation
dedup_partition's replace_where commit carries bare-string timestamp
bounds — the only predicate form delta-rs can stringify. Whenever a flush
append lands between dedup's snapshot and commit, the OCC checker
re-evaluates that predicate with delta-kernel and errors ('arrow_cast
should have been simplified to cast'), aborting the commit. On a busy
table some append always interleaves, so every sweep failed after 4x15
retries, materializing and uploading chunk parquet each attempt and
abandoning it — every 5 minutes, per probe-positive project. Observed in
prod as 165 OCC failures/20min, RSS climbing to the 69.7GB memcg ceiling
(3 kernel OOM kills today) and the ~35min restart loop; queries hung
behind the churn (60s+ for a 10-min-window LIMIT 501 that runs in <1s on
a fresh container).
Fix: a process-wide delta_commit_lock serializing flush appends and dedup
replace_where commits. With no commit able to interleave, the rebase sees
no newer versions and the conflict checker (and its predicate-eval bug)
never runs. Lock is dropped before retry backoff sleeps so peers don't
queue behind a sleeping writer.
Regression test runs dedup under continuous concurrent appends: fails on
master, deterministic pass with the lock.
Code Review — fix(dedup): serialize in-process Delta commitsSummary: The fix is sound — a per- 🔴 Critical — Accidental root-level file pollutionThe diff adds two large files at the repository root that should not be there:
These appear to be working copies accidentally committed (perhaps from a git rm database.rs dedup_compaction_test.rs🟡 Moderate — Lock granularity serializes all tables, not just the racing pair
The OCC predicate bug is specific to This matters most if the process manages several actively-written tables simultaneously. 🟢 Lock correctness — looks rightThe critical detail is that 🟡 Moderate — Test race assertion could be fragileassert!(appended > 0, "appender must have raced at least one commit");On a heavily-loaded CI runner with Consider a small pre-flight: wait until the appender has committed at least one row before starting dedup, or insert a 🟢 Minor — Atomic ordering on the stop flagstop.load(std::sync::atomic::Ordering::Relaxed)
stop.store(true, std::sync::atomic::Ordering::Relaxed)
🟢 Minor — Missing failure message on post-dedup assertionassert_eq!(post[0].column(0).as_primitive::<Int64Type>().value(0), 1);The earlier assertions in the same test pass a message string. This one doesn't, making CI failures harder to diagnose. Suggest: assert_eq!(
post[0].column(0).as_primitive::<Int64Type>().value(0), 1,
"post-dedup: dup_id row should be collapsed to 1"
);✅ Positives
Verdict: The core fix is correct and well-tested. The root-level file pollution is the only blocker — everything else is improvement feedback. Clean up those two stray files and this is good to merge. |
|
Closing as superseded. The identical Rebasing this branch onto current master produces an empty diff (identical tree, zero net commits) — there is nothing left to merge. The fix is live in master. |
Problem (active prod outage driver)
dedup_partition'sreplace_wherecommit carries bare-string timestamp bounds — the only predicate form delta-rs can stringify. Whenever a flush append lands between dedup's snapshot and its commit, the OCC conflict checker re-evaluates that predicate via delta-kernel and errors (arrow_cast should have been simplified to cast), aborting the commit. On a busy table some append always interleaves, so every sweep fails after 4×15 retries — materializing the chunk, uploading parquet to R2, and abandoning it on each attempt, every 5 minutes, per probe-positive project.Observed on prod today (image
ae5a9ab):... ORDER BY timestamp DESC LIMIT 501) hanging 60s+ behind the churn, vs <1s on a freshly restarted containerFix
A process-wide
delta_commit_lock(tokioMutex) serializing the two in-process commit paths: flush appends (insert_batchwrite loop) and dedupreplace_where. With commits unable to interleave, dedup's rebase sees no newer versions and the conflict checker — and its predicate-eval bug — never runs. The lock is dropped before retry backoff sleeps so peers don't queue behind a sleeping writer. Appends-vs-appends pay a short queue wait (sub-second commits vs the 10-min flush cadence).Test
dedup_commits_despite_concurrent_appends: runsdedup_partitionwhile a task commits appends continuously into the same partition date. Fails on master (dedup_partition write failed: ... arrow_cast should have been simplified), passes deterministically with the lock.Full
dedup_compaction_testsuite + lib tests pass (test_batch_queue_under_loadpre-existing failure on master).