You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+78-42Lines changed: 78 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,67 +21,102 @@ sidebar_position: 1
21
21
22
22
# Paimon
23
23
24
-
[Apache Paimon](https://paimon.apache.org/) innovatively combines lake format and LSM structure, bringing efficient updates into the lake architecture.
25
-
To integrate Fluss with Paimon, you must enable lakehouse storage and configure Paimon as lakehouse storage. See more detail about[Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
24
+
[Apache Paimon](https://paimon.apache.org/) innovatively combines a lake format with an LSM (Log-Structured Merge-tree) structure, bringing efficient updates into the lake architecture .
25
+
To integrate Fluss with Paimon, you must enable lakehouse storage and configure Paimon as the lakehouse storage. For more details, see[Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
26
26
27
27
## Introduction
28
-
When a table with option `'table.datalake.enabled' = 'true'` is created or altered in Fluss, Fluss will create a corresponding Paimon table with same table path as well.
29
-
The schema of the Paimon table is as same as the schema of the Fluss table, except for there are two extra columns `__offset` and `__timestamp` appended to the last.
30
-
These two columns are used to help Fluss client to consume the data in Paimon in streaming way like seek by offset/timestamp, etc.
31
28
32
-
Then datalake tiering service compacts the data from Fluss to Paimon continuously. For primary key table, it will also generate change log in Paimon format which
33
-
enables you streaming consume it in Paimon way.
29
+
When a table with the option `'table.datalake.enabled' = 'true'` is created or altered in Fluss, Fluss will automatically create a corresponding Paimon table with the same table path .
30
+
The schema of the Paimon table matches that of the Fluss table, except for the addition of three system columns at the end: `__bucket`, `__offset`, and `__timestamp`.
31
+
These system columns help Fluss clients consume data from Paimon in a streaming fashion—such as seeking by a specific bucket using an offset or timestamp.
34
32
35
-
## Read tables
33
+
```sql title="Flink SQL"
34
+
USE CATALOG fluss_catalog;
35
+
36
+
CREATETABLEfluss_order_with_lake (
37
+
`order_key`BIGINT,
38
+
`cust_key`INTNOT NULL,
39
+
`total_price`DECIMAL(15, 2),
40
+
`order_date`DATE,
41
+
`order_priority` STRING,
42
+
`clerk` STRING,
43
+
`ptime`AS PROCTIME(),
44
+
PRIMARY KEY (`order_key`) NOT ENFORCED
45
+
) WITH (
46
+
'table.datalake.enabled'='true',
47
+
'table.datalake.freshness'='30s');
48
+
```
49
+
50
+
Then, the datalake tiering service continuously tiers data from Fluss to Paimon. The parameter `table.datalake.freshness` controls how soon data written to Fluss should be tiered to Paimon—by default, this delay is 3 minutes.
51
+
For primary key tables, change logs are also generated in Paimon format, enabling stream-based consumption via Paimon APIs.
52
+
53
+
Since Fluss version 0.7, you can also specify Paimon table properties when creating a datalake-enabled Fluss table by using the `paimon.` prefix within the Fluss table properties clause.
54
+
55
+
```sql title="Flink SQL"
56
+
CREATETABLEfluss_order_with_lake (
57
+
`order_key`BIGINT,
58
+
`cust_key`INTNOT NULL,
59
+
`total_price`DECIMAL(15, 2),
60
+
`order_date`DATE,
61
+
`order_priority` STRING,
62
+
`clerk` STRING,
63
+
`ptime`AS PROCTIME(),
64
+
PRIMARY KEY (`order_key`) NOT ENFORCED
65
+
) WITH (
66
+
'table.datalake.enabled'='true',
67
+
'table.datalake.freshness'='30s',
68
+
'paimon.file.format'='orc',
69
+
'paimon.deletion-vectors.enabled'='true');
70
+
```
71
+
72
+
For example, you can specify the Paimon property `file.format` to change the file format of the Paimon table, or set `deletion-vectors.enabled` to enable or disable deletion vectors for the Paimon table.
73
+
74
+
## Read Tables
36
75
37
76
### Read by Flink
38
77
39
-
For the table with option `'table.datalake.enabled' = 'true'`, there are two part of data: the data remains in Fluss and the data already in Paimon.
40
-
Now, you have two view of the table: one view is the Paimon data which has minute-level latency, one view is the full data union Fluss and Paimon data
41
-
which is the latest within second-level latency.
78
+
For a table with the option `'table.datalake.enabled' = 'true'`, its data exists in two layers: one remains in Fluss, and the other has already been tiered to Paimon.
79
+
You can choose between two views of the table:
80
+
- A **Paimon-only view**, which offers minute-level latency but better analytics performance.
81
+
- A **combined view** of both Fluss and Paimon data, which provides second-level latency but may result in slightly degraded query performance.
42
82
43
-
Flink empowers you to decide to choose which view:
44
-
- Only Paimon means a better analytics performance but with worse data freshness
45
-
- Combing Fluss and Paimon means a better data freshness but with analytics performance degrading
83
+
#### Read Data Only in Paimon
46
84
47
-
#### Read data only in Paimon
48
-
To point to read data in Paimon, you must specify the table with `$lake` suffix, the following
49
-
SQL shows how to do that:
85
+
To read only data stored in Paimon, use the `$lake` suffix in the table name. The following example demonstrates this:
50
86
51
87
```sql title="Flink SQL"
52
-
--assume we have a table named `orders`
88
+
--Assume we have a table named `orders`
53
89
54
-
--read from paimon
90
+
--Read from Paimon
55
91
SELECTCOUNT(*) FROM orders$lake;
56
92
```
57
93
58
94
```sql title="Flink SQL"
59
-
--we can also query the system tables
95
+
--We can also query the system tables
60
96
SELECT*FROM orders$lake$snapshots;
61
97
```
62
98
63
-
When specify the table with `$lake` suffix in query, it just acts like a normal Paimon table, so it inherits all ability of Paimon table.
64
-
You can enjoy all the features that Flink's query supports/optimization on Paimon, like query system tables, time travel, etc. See more
65
-
about Paimon's [sql-query](https://paimon.apache.org/docs/0.9/flink/sql-query/#sql-query).
99
+
When you specify the `$lake` suffix in a query, the table behaves like a standard Paimon table and inherits all its capabilities.
100
+
This allows you to take full advantage of Flink's query support and optimizations on Paimon, such as querying system tables, time travel, and more.
101
+
For further information, refer to Paimon’s [SQL Query documentation](https://paimon.apache.org/docs/0.9/flink/sql-query/#sql-query).
66
102
103
+
#### Union Read of Data in Fluss and Paimon
67
104
68
-
#### Union read data in Fluss and Paimon
69
-
To point to read the full data that union Fluss and Paimon, you just query it as a normal table without any suffix or others, the following
70
-
SQL shows how to do that:
105
+
To read the full dataset, which includes both Fluss and Paimon data, simply query the table without any suffix. The following example illustrates this:
71
106
72
107
```sql title="Flink SQL"
73
-
--query will union data of Fluss and Paimon
74
-
SELECTSUM(order_count) as total_orders FROM ads_nation_purchase_power;
108
+
--Query will union data from Fluss and Paimon
109
+
SELECTSUM(order_count) AS total_orders FROM ads_nation_purchase_power;
75
110
```
76
-
The query may look slower than only querying data in Paimon, but it queries the full data which means better data freshness. You can
77
-
run the query multi-times, you should get different results in every one run as the data is written to the table continuously.
78
111
79
-
### Read by other engines
112
+
This query may run slower than reading only from Paimon, but it returns the most up-to-date data. If you execute the query multiple times, you may observe different results due to continuous data ingestion.
113
+
114
+
### Read by Other Engines
115
+
116
+
Since the data tiered to Paimon from Fluss is stored as a standard Paimon table, you can use any engine that supports Paimon to read it. Below is an example using [StarRocks](https://paimon.apache.org/docs/master/engines/starrocks/):
80
117
81
-
As the tiered data in Paimon compacted from Fluss is also a standard Paimon table, you can use
82
-
[any engines](https://paimon.apache.org/docs/0.9/engines/overview/) that support Paimon to read the data. Here, we take [StarRocks](https://paimon.apache.org/docs/master/engines/starrocks/) as the engine to read the data:
118
+
First, create a Paimon catalog in StarRocks:
83
119
84
-
First, create a Paimon catalog for StarRocks:
85
120
```sql title="StarRocks SQL"
86
121
CREATE EXTERNAL CATALOG paimon_catalog
87
122
PROPERTIES
@@ -92,23 +127,24 @@ PROPERTIES
92
127
);
93
128
```
94
129
95
-
**NOTE**: The configuration value `paimon.catalog.type` and `paimon.catalog.warehouse` should be same as how you configure the Paimon as lakehouse storage for Fluss in `server.yaml`.
130
+
> **NOTE**: The configuration values for `paimon.catalog.type` and `paimon.catalog.warehouse` must match those used when configuring Paimon as the lakehouse storage for Fluss in `server.yaml`.
131
+
132
+
Then, you can query the `orders` table using StarRocks:
96
133
97
-
Then, you can query the `orders` table by StarRocks:
98
134
```sql title="StarRocks SQL"
99
-
--the table is in database `fluss`
135
+
--The table is in the database `fluss`
100
136
SELECTCOUNT(*) FROMpaimon_catalog.fluss.orders;
101
137
```
102
138
103
139
```sql title="StarRocks SQL"
104
-
--query the system tables, to know the snapshots of the table
140
+
--Query the system tables to view snapshots of the table
0 commit comments