Skip to content

Commit aa067db

Browse files
Add blog: Conflict-Free CDC into Apache Iceberg
1 parent 700586d commit aa067db

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

blog/2026-05-13-conflict-free-cdc-into-apache-iceberg.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ To address the read amplification caused by Equality Deletes, OLake Fusion runs
116116
Resolving broad Equality Deletes on the fly requires the query engine to execute expensive dynamic anti-joins in memory. Fusion converts these logical rules into specific Position Deletes, specifying the exact file paths and row offsets to skip for the query engine. Additionally, it repairs fragmentation by rolling small micro-batches into larger segments (typically close to 1/8th of the target 256MB or 512MB file size).
117117

118118
**Medium Compaction (The Physical Purge):**
119-
A separate bin-packing process reduces these segments to the final target file size. Crucially, it physically removes the deleted rows from the Parquet blocks entirely. This restores contiguous memory layouts, allowing the query engine to scan only active data and maintain its SIMD vectorization efficiency.
119+
A separate bin-packing process reduces these segments to nearly the final target file size. It does not guarantee that after compaction data file will be of target file size.. Crucially, it physically removes the deleted rows from the Parquet blocks entirely. This restores contiguous memory layouts, allowing the query engine to scan only active data and maintain its SIMD vectorization efficiency.
120120

121121
**Full Compaction (Global Reorganization):**
122122
While Medium compaction handles localized cleanup, Full Compaction is the heavyweight background process that completely resets a partition's technical debt. Rather than relying on advanced metadata-level pruning or complex clustering, OLake focuses on raw structural efficiency. It comprehensively rewrites all base data and delete files into perfectly sized, pristine Parquet blocks. This eliminates all read-time reconciliation overhead, allowing the downstream execution engines to query pure, unfragmented data.

0 commit comments

Comments
 (0)