[docs][lake/iceberg] Add a part about streaming union read in icebeg doc (#1774)

beryllw · luoyuxia · web-flow · commit 7ab3ca51130c · 2025-09-30T18:01:08.000+08:00
---------

Co-authored-by: luoyuxia &lt;luoyuxia@alumni.sjtu.edu.cn&gt;
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md b/website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
@@ -406,6 +406,44 @@ All Iceberg tables created by Fluss include three system columns:
 
 ## Read Tables
 
+### 🐿️ Reading with Apache Flink
+
+When a table has the configuration `table.datalake.enabled = 'true'`, its data exists in two layers:
+
+- Fresh data is retained in Fluss
+- Historical data is tiered to Iceberg
+
+#### Union Read of Data in Fluss and Iceberg
+You can query a combined view of both layers with second-level latency which is called union read.
+
+##### Prerequisites
+
+You need to place the JARs required by Iceberg to read data into `${FLINK_HOME}/lib`. For detailed dependencies and JAR preparation instructions, refer to [🚀 Start Tiering Service to Iceberg](#-start-tiering-service-to-iceberg).
+
+##### Union Read
+
+To read the full dataset, which includes both Fluss (fresh) and Iceberg (historical) data, simply query the table without any suffix. The following example illustrates this:
+
+```sql
+-- Set execution mode to streaming or batch, here just take batch as an example
+SET 'execution.runtime-mode' = 'batch';
+
+-- Query will union data from Fluss and Iceberg
+select SUM(visit_count) from fluss_access_log;
+```
+
+It supports both batch and streaming modes, utilizing Iceberg for historical data and Fluss for fresh data:
+
+- **Batch mode** (only log table)
+
+- **Streaming mode** (primary key table and log table)
+
+  Flink first reads the latest Iceberg snapshot (tiered via tiering service), then switches to Fluss starting from the log offset matching that snapshot. This design minimizes Fluss storage requirements (reducing costs) while using Iceberg as a complete historical archive.
+
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
+
 ### 🔍 Reading with Other Engines
 
 Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables, you can use any Iceberg-compatible engine. Below is an example using [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -119,9 +119,9 @@ It supports both batch and streaming modes, using Paimon for historical data and
   Flink first reads the latest Paimon snapshot (tiered via tiering service), then switches to Fluss starting from the log offset aligned with that snapshot, ensuring exactly-once semantics.
   This design enables Fluss to store only a small portion of the dataset in the Fluss cluster, reducing costs, while Paimon serves as the source of complete historical data when needed. 
 
-  More precisely, if Fluss log data is removed due to TTL expiration—controlled by the `table.log.ttl` configuration—it can still be read by Flink through its Union Read capability, as long as the data has already been tiered to Paimon.
-  For partitioned tables, if a partition is cleaned up—controlled by the `table.auto-partition.num-retention` configuration—the data in that partition remains accessible from Paimon, provided it has been tiered there beforehand. 
-
+Key behavior for data retention:
+- **Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
+- **Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
 
 ### Reading with other Engines