feat: support incremental scan between 2 snapshots #13

xxchan · 2025-03-02T15:18:06Z

Related code in Java

Signed-off-by: xxchan <[email protected]>

chenzl25 · 2025-03-17T06:12:26Z

crates/iceberg/src/scan.rs

+            if !snapshot_ids.contains(&manifest_file.added_snapshot_id) {
+                continue;
+            }
+            let manifest = object_cache.get_manifest(manifest_file).await?;
+            let entries = manifest.entries().iter().filter(|entry| {
+                matches!(entry.status(), ManifestStatus::Added)
+                    && (
+                        // Is it possible that the snapshot id here is not contained?
+                        entry.snapshot_id().is_none()
+                            || snapshot_ids.contains(&entry.snapshot_id().unwrap())
+                    )
+            });


I think simply checking manifest status Added is not enough, since compaction will add an manifest entry with Added as well, however, we don't want to consume a compaction snapshot. We should detect whether a snapshot is a compaction snapshot. If the snapshot is a compaction, we should skip it. For compaction snapshot, I think we can just check if there is a manifest entry having ManifestStatus::Deleted.

We have filtered Append snapshots above:

let append_snapshots = ancestors_between(table_metadata, latest_snapshot_id, oldest_snapshot_id) .filter(|snapshot| matches!(snapshot.summary().operation, Operation::Append))

The definition of snapshot operation is:
(so compaction snapshot should be Replace)

/// The operation field is used by some operations, like snapshot expiration, to skip processing certain snapshots. pub enum Operation { /// Only data files were added and no files were removed. Append, /// Data and delete files were added and removed without changing table data; /// i.e., compaction, changing the data file format, or relocating data files. Replace, /// Data and delete files were added and removed in a logical overwrite operation. Overwrite, /// Data files were removed and their contents logically deleted and/or delete files were added to delete rows. Delete, }

Found a problem: Seems delete files are also included in Append snapshot. 🤔

Encountered an entry for a delete file in a data file manifest: ManifestEntry { status: Added, snapshot_id: Some(4746599406424318107), sequence_number: Some(4), file_sequence_number: Some(4), data_file: DataFile { content: EqualityDeletes, file_path: "s3://hummock001/iceberg_connection/public/t/data/11-00077-eq-del-0195a491-109c-75b1-a93c-e29d371c2d0b.parquet", file_format: Parquet, partition: Struct { fields: [], null_bitmap: BitVec<usize, bitvec::order::Lsb0> { addr: 0x8, head: 000000, bits: 0, capacity: 0 } [] }, record_count: 2, file_size_in_bytes: 680, column_sizes: {1: 82}, value_counts: {1: 2}, null_value_counts: {1: 0}, nan_value_counts: {}, lower_bounds: {1: Datum { type: Long, literal: Long(524235685507891200) }}, upper_bounds: {1: Datum { type: Long, literal: Long(524235873517572096) }}, key_metadata: None, split_offsets: [4], equality_ids: [1], sort_order_id: None } }

The java code uses a ignoreDeleted to further filter the manifest.
https://github.com/apache/iceberg/blob/6ec3de390d3fa6e797c6975b1eaaea41719db0fe/core/src/main/java/org/apache/iceberg/BaseIncrementalAppendScan.java#L86

/// Only data files were added and no files were removed. Append,

Indeed, by definition, there should be no delete files in an append snapshot. However, RisingWave's iceberg sink would include delete files to the append snapshot. Shall we change the snapshot operation to Overwrite if it is not an append-only commit? But once we change it to OverWrite, it means we can't read any incremental data between the snapshots. It seems we should add the added data of overwrite snapshot to the incremental read as well.

Some facts I found:

Changelog scan with deletes are not supported yet: Support changelog scan for table with delete files apache/iceberg#10935

Changelog scan is not used by flink

Oh, it's used in SparkChangelogTable https://iceberg.apache.org/docs/latest/spark-procedures/?h=changes#carry-over-rows

cc @ZENOTME If we commit delete files to the iceberg, it seems we should change the iceberg snapshot operation to OverWrite instead of Append.

Oh, it's used in SparkChangelogTable https://iceberg.apache.org/docs/latest/spark-procedures/?h=changes#carry-over-rows

I think we should handle OverWrite as well. Ignoring delete files is acceptable.

+1. kind of like "force-append-only"

chenzl25

LGTM!

Signed-off-by: xxchan <[email protected]>

oven-yang · 2025-03-20T03:40:16Z

crates/iceberg/src/scan.rs

+                manifest_entry_data_ctx_tx
+                    .clone()
+                    .send(manifest_entry_context)
+                    .await


I was looking over the code, and I’m wondering if there might be a chance for the logic to get stuck here when the channel’s buffer size is smaller than the number of entries.
It seems like the channel could block once the buffer fills up, and since the consumers only start after all entries are sent, it might cause a hang.
I could be missing something, though—do you think this could be an issue, or is there a detail I might not be seeing? I’d really appreciate your thoughts!

Hi, curious how did you find here 👀

I came across your PR while browsing through this repo, and it caught my attention because I'm looking for a way to implement incremental ingestion for an Iceberg table using Rust. Great work on it!

Thanks, I think your insight is correct here. I've refined implementation in #36

* fixup! feat: support incremental scan between 2 snapshots (#13) refactor: refactor incremental scan * clippy Signed-off-by: xxchan <[email protected]> * add some check Signed-off-by: xxchan <[email protected]> --------- Signed-off-by: xxchan <[email protected]>

xxchan changed the title . incremental scan Mar 2, 2025

xxchan force-pushed the xxchan/incremental branch 2 times, most recently from d4d27e6 to a1c426e Compare March 2, 2025 15:21

xxchan changed the base branch from archive_dev_rebase_main_20241230 to dev_rebase_main_20250307 March 13, 2025 06:25

xxchan force-pushed the xxchan/incremental branch 3 times, most recently from 3636dc3 to a66db13 Compare March 13, 2025 06:57

xxchan marked this pull request as ready for review March 13, 2025 07:00

xxchan changed the title ~~incremental scan~~ feat: support incremental scan between 2 snapshots Mar 13, 2025

xxchan force-pushed the xxchan/incremental branch 2 times, most recently from 8e6aebb to c8b4714 Compare March 16, 2025 13:23

.

3c0e53e

Signed-off-by: xxchan <[email protected]>

xxchan force-pushed the xxchan/incremental branch from c8b4714 to 3c0e53e Compare March 16, 2025 14:55

xxchan requested review from chenzl25 and ZENOTME March 17, 2025 02:50

chenzl25 requested a review from Li0k March 17, 2025 03:04

chenzl25 reviewed Mar 17, 2025

View reviewed changes

chenzl25 approved these changes Mar 17, 2025

View reviewed changes

xxchan added 2 commits March 17, 2025 23:17

fix: check delete file

c59b738

Signed-off-by: xxchan <[email protected]>

include Operation::Overwrite

4447241

Signed-off-by: xxchan <[email protected]>

xxchan merged commit 4373389 into risingwavelabs:dev_rebase_main_20250307 Mar 18, 2025
11 checks passed

xxchan mentioned this pull request Mar 18, 2025

feat: support streaming iceberg source risingwavelabs/risingwave#20568

Merged

9 tasks

oven-yang reviewed Mar 20, 2025

View reviewed changes

xxchan mentioned this pull request Mar 25, 2025

Tracking: sync changes from upstream and drop commits in our fork #31

Open

xxchan added a commit that referenced this pull request Mar 25, 2025

feat: support incremental scan between 2 snapshots (#13)

d6faf63

feat: support incremental scan between 2 snapshots #13

feat: support incremental scan between 2 snapshots #13

Uh oh!

Conversation

xxchan commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenzl25 Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxchan Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenzl25 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xxchan Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xxchan commented Mar 2, 2025 •

edited

Loading

chenzl25 Mar 17, 2025 •

edited

Loading

xxchan Mar 17, 2025 •

edited

Loading

xxchan Mar 20, 2025 •

edited

Loading