You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
+38Lines changed: 38 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -406,6 +406,44 @@ All Iceberg tables created by Fluss include three system columns:
406
406
407
407
## Read Tables
408
408
409
+
### 🐿️ Reading with Apache Flink
410
+
411
+
When a table has the configuration `table.datalake.enabled = 'true'`, its data exists in two layers:
412
+
413
+
- Fresh data is retained in Fluss
414
+
- Historical data is tiered to Iceberg
415
+
416
+
#### Union Read of Data in Fluss and Iceberg
417
+
You can query a combined view of both layers with second-level latency which is called union read.
418
+
419
+
##### Prerequisites
420
+
421
+
You need to place the JARs required by Iceberg to read data into `${FLINK_HOME}/lib`. For detailed dependencies and JAR preparation instructions, refer to [🚀 Start Tiering Service to Iceberg](#-start-tiering-service-to-iceberg).
422
+
423
+
##### Union Read
424
+
425
+
To read the full dataset, which includes both Fluss (fresh) and Iceberg (historical) data, simply query the table without any suffix. The following example illustrates this:
426
+
427
+
```sql
428
+
-- Set execution mode to streaming or batch, here just take batch as an example
429
+
SET'execution.runtime-mode'='batch';
430
+
431
+
-- Query will union data from Fluss and Iceberg
432
+
selectSUM(visit_count) from fluss_access_log;
433
+
```
434
+
435
+
It supports both batch and streaming modes, utilizing Iceberg for historical data and Fluss for fresh data:
436
+
437
+
-**Batch mode** (only log table)
438
+
439
+
-**Streaming mode** (primary key table and log table)
440
+
441
+
Flink first reads the latest Iceberg snapshot (tiered via tiering service), then switches to Fluss starting from the log offset matching that snapshot. This design minimizes Fluss storage requirements (reducing costs) while using Iceberg as a complete historical archive.
442
+
443
+
Key behavior for data retention:
444
+
-**Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
445
+
-**Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
446
+
409
447
### 🔍 Reading with Other Engines
410
448
411
449
Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables, you can use any Iceberg-compatible engine. Below is an example using [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):
Copy file name to clipboardExpand all lines: website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,9 +119,9 @@ It supports both batch and streaming modes, using Paimon for historical data and
119
119
Flink first reads the latest Paimon snapshot (tiered via tiering service), then switches to Fluss starting from the log offset aligned with that snapshot, ensuring exactly-once semantics.
120
120
This design enables Fluss to store only a small portion of the dataset in the Fluss cluster, reducing costs, while Paimon serves as the source of complete historical data when needed.
121
121
122
-
More precisely, if Fluss log data is removed due to TTL expiration—controlled by the `table.log.ttl` configuration—it can still be read by Flink through its Union Read capability, as long as the data has already been tiered to Paimon.
123
-
For partitioned tables, if a partition is cleaned up—controlled by the `table.auto-partition.num-retention` configuration—the data in that partition remains accessible from Paimon, provided it has been tiered there beforehand.
124
-
122
+
Key behavior for data retention:
123
+
-**Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
124
+
-**Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
0 commit comments