Skip to content

Commit 0673881

Browse files
authored
[doc] Add streaming union read part for paimon document (#1747)
1 parent 223a3c4 commit 0673881

File tree

1 file changed

+23
-1
lines changed
  • website/docs/streaming-lakehouse/integrate-data-lakes

1 file changed

+23
-1
lines changed

website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ You can choose between two views of the table:
7272

7373
#### Read Data Only in Paimon
7474

75+
##### Prerequisites
76+
Download the [paimon-flink.jar](https://paimon.apache.org/docs/1.2/) that matches your Flink version, and place it in the `FLINK_HOME/lib` directory
77+
78+
##### Read Paimon Data
7579
To read only data stored in Paimon, use the `$lake` suffix in the table name. The following example demonstrates this:
7680

7781
```sql title="Flink SQL"
@@ -92,14 +96,32 @@ For further information, refer to Paimon’s [SQL Query documentation](https://p
9296

9397
#### Union Read of Data in Fluss and Paimon
9498

99+
##### Prerequisites
100+
Download the [fluss-lake-paimon-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar), and place it into `${FLINK_HOME}/lib`.
101+
102+
##### Union Read
95103
To read the full dataset, which includes both Fluss (fresh) and Paimon (historical) data, simply query the table without any suffix. The following example illustrates this:
96104

97105
```sql title="Flink SQL"
106+
-- Set execution mode to streaming or batch, here just take batch as an example
107+
SET 'execution.runtime-mode' = 'batch';
108+
98109
-- Query will union data from Fluss and Paimon
99110
SELECT SUM(order_count) AS total_orders FROM ads_nation_purchase_power;
100111
```
112+
It supports both batch and streaming modes, using Paimon for historical data and Fluss for fresh data:
113+
- In batch mode
114+
115+
The query may run slower than reading only from Paimon because it needs to merge rows from both Paimon and Fluss. However, it returns the most up-to-date results. Multiple executions of the query may produce different outputs due to continuous data ingestion.
116+
117+
- In streaming mode
118+
119+
Flink first reads the latest Paimon snapshot (tiered via tiering service), then switches to Fluss starting from the log offset aligned with that snapshot, ensuring exactly-once semantics.
120+
This design enables Fluss to store only a small portion of the dataset in the Fluss cluster, reducing costs, while Paimon serves as the source of complete historical data when needed.
121+
122+
More precisely, if Fluss log data is removed due to TTL expiration—controlled by the `table.log.ttl` configuration—it can still be read by Flink through its Union Read capability, as long as the data has already been tiered to Paimon.
123+
For partitioned tables, if a partition is cleaned up—controlled by the `table.auto-partition.num-retention` configuration—the data in that partition remains accessible from Paimon, provided it has been tiered there beforehand.
101124

102-
This query may run slower than reading only from Paimon, but it returns the most up-to-date data. If you execute the query multiple times, you may observe different results due to continuous data ingestion.
103125

104126
### Reading with other Engines
105127

0 commit comments

Comments
 (0)