Skip to content

Commit 7ab3ca5

Browse files
beryllwluoyuxia
andauthored
[docs][lake/iceberg] Add a part about streaming union read in icebeg doc (#1774)
--------- Co-authored-by: luoyuxia <[email protected]>
1 parent 85904f2 commit 7ab3ca5

File tree

2 files changed

+41
-3
lines changed

2 files changed

+41
-3
lines changed

website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,44 @@ All Iceberg tables created by Fluss include three system columns:
406406

407407
## Read Tables
408408

409+
### 🐿️ Reading with Apache Flink
410+
411+
When a table has the configuration `table.datalake.enabled = 'true'`, its data exists in two layers:
412+
413+
- Fresh data is retained in Fluss
414+
- Historical data is tiered to Iceberg
415+
416+
#### Union Read of Data in Fluss and Iceberg
417+
You can query a combined view of both layers with second-level latency which is called union read.
418+
419+
##### Prerequisites
420+
421+
You need to place the JARs required by Iceberg to read data into `${FLINK_HOME}/lib`. For detailed dependencies and JAR preparation instructions, refer to [🚀 Start Tiering Service to Iceberg](#-start-tiering-service-to-iceberg).
422+
423+
##### Union Read
424+
425+
To read the full dataset, which includes both Fluss (fresh) and Iceberg (historical) data, simply query the table without any suffix. The following example illustrates this:
426+
427+
```sql
428+
-- Set execution mode to streaming or batch, here just take batch as an example
429+
SET 'execution.runtime-mode' = 'batch';
430+
431+
-- Query will union data from Fluss and Iceberg
432+
select SUM(visit_count) from fluss_access_log;
433+
```
434+
435+
It supports both batch and streaming modes, utilizing Iceberg for historical data and Fluss for fresh data:
436+
437+
- **Batch mode** (only log table)
438+
439+
- **Streaming mode** (primary key table and log table)
440+
441+
Flink first reads the latest Iceberg snapshot (tiered via tiering service), then switches to Fluss starting from the log offset matching that snapshot. This design minimizes Fluss storage requirements (reducing costs) while using Iceberg as a complete historical archive.
442+
443+
Key behavior for data retention:
444+
- **Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
445+
- **Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
446+
409447
### 🔍 Reading with Other Engines
410448

411449
Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables, you can use any Iceberg-compatible engine. Below is an example using [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):

website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -119,9 +119,9 @@ It supports both batch and streaming modes, using Paimon for historical data and
119119
Flink first reads the latest Paimon snapshot (tiered via tiering service), then switches to Fluss starting from the log offset aligned with that snapshot, ensuring exactly-once semantics.
120120
This design enables Fluss to store only a small portion of the dataset in the Fluss cluster, reducing costs, while Paimon serves as the source of complete historical data when needed.
121121

122-
More precisely, if Fluss log data is removed due to TTL expiration—controlled by the `table.log.ttl` configuration—it can still be read by Flink through its Union Read capability, as long as the data has already been tiered to Paimon.
123-
For partitioned tables, if a partition is cleaned up—controlled by the `table.auto-partition.num-retention` configuration—the data in that partition remains accessible from Paimon, provided it has been tiered there beforehand.
124-
122+
Key behavior for data retention:
123+
- **Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
124+
- **Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
125125

126126
### Reading with other Engines
127127

0 commit comments

Comments
 (0)