You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+23-1Lines changed: 23 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,6 +72,10 @@ You can choose between two views of the table:
72
72
73
73
#### Read Data Only in Paimon
74
74
75
+
##### Prerequisites
76
+
Download the [paimon-flink.jar](https://paimon.apache.org/docs/1.2/) that matches your Flink version, and place it in the `FLINK_HOME/lib` directory
77
+
78
+
##### Read Paimon Data
75
79
To read only data stored in Paimon, use the `$lake` suffix in the table name. The following example demonstrates this:
76
80
77
81
```sql title="Flink SQL"
@@ -92,14 +96,32 @@ For further information, refer to Paimon’s [SQL Query documentation](https://p
92
96
93
97
#### Union Read of Data in Fluss and Paimon
94
98
99
+
##### Prerequisites
100
+
Download the [fluss-lake-paimon-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar), and place it into `${FLINK_HOME}/lib`.
101
+
102
+
##### Union Read
95
103
To read the full dataset, which includes both Fluss (fresh) and Paimon (historical) data, simply query the table without any suffix. The following example illustrates this:
96
104
97
105
```sql title="Flink SQL"
106
+
-- Set execution mode to streaming or batch, here just take batch as an example
107
+
SET'execution.runtime-mode'='batch';
108
+
98
109
-- Query will union data from Fluss and Paimon
99
110
SELECTSUM(order_count) AS total_orders FROM ads_nation_purchase_power;
100
111
```
112
+
It supports both batch and streaming modes, using Paimon for historical data and Fluss for fresh data:
113
+
- In batch mode
114
+
115
+
The query may run slower than reading only from Paimon because it needs to merge rows from both Paimon and Fluss. However, it returns the most up-to-date results. Multiple executions of the query may produce different outputs due to continuous data ingestion.
116
+
117
+
- In streaming mode
118
+
119
+
Flink first reads the latest Paimon snapshot (tiered via tiering service), then switches to Fluss starting from the log offset aligned with that snapshot, ensuring exactly-once semantics.
120
+
This design enables Fluss to store only a small portion of the dataset in the Fluss cluster, reducing costs, while Paimon serves as the source of complete historical data when needed.
121
+
122
+
More precisely, if Fluss log data is removed due to TTL expiration—controlled by the `table.log.ttl` configuration—it can still be read by Flink through its Union Read capability, as long as the data has already been tiered to Paimon.
123
+
For partitioned tables, if a partition is cleaned up—controlled by the `table.auto-partition.num-retention` configuration—the data in that partition remains accessible from Paimon, provided it has been tiered there beforehand.
101
124
102
-
This query may run slower than reading only from Paimon, but it returns the most up-to-date data. If you execute the query multiple times, you may observe different results due to continuous data ingestion.
0 commit comments