Skip to content

Commit 6b0352b

Browse files
committed
[lake/docs] Update paimon docs in lakehouse section
1 parent a5f651d commit 6b0352b

File tree

1 file changed

+78
-42
lines changed
  • website/docs/streaming-lakehouse/integrate-data-lakes

1 file changed

+78
-42
lines changed

website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md

Lines changed: 78 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -21,67 +21,102 @@ sidebar_position: 1
2121

2222
# Paimon
2323

24-
[Apache Paimon](https://paimon.apache.org/) innovatively combines lake format and LSM structure, bringing efficient updates into the lake architecture.
25-
To integrate Fluss with Paimon, you must enable lakehouse storage and configure Paimon as lakehouse storage. See more detail about [Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
24+
[Apache Paimon](https://paimon.apache.org/) innovatively combines a lake format with an LSM (Log-Structured Merge-tree) structure, bringing efficient updates into the lake architecture .
25+
To integrate Fluss with Paimon, you must enable lakehouse storage and configure Paimon as the lakehouse storage. For more details, see [Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
2626

2727
## Introduction
28-
When a table with option `'table.datalake.enabled' = 'true'` is created or altered in Fluss, Fluss will create a corresponding Paimon table with same table path as well.
29-
The schema of the Paimon table is as same as the schema of the Fluss table, except for there are two extra columns `__offset` and `__timestamp` appended to the last.
30-
These two columns are used to help Fluss client to consume the data in Paimon in streaming way like seek by offset/timestamp, etc.
3128

32-
Then datalake tiering service compacts the data from Fluss to Paimon continuously. For primary key table, it will also generate change log in Paimon format which
33-
enables you streaming consume it in Paimon way.
29+
When a table with the option `'table.datalake.enabled' = 'true'` is created or altered in Fluss, Fluss will automatically create a corresponding Paimon table with the same table path .
30+
The schema of the Paimon table matches that of the Fluss table, except for the addition of three system columns at the end: `__bucket`, `__offset`, and `__timestamp`.
31+
These system columns help Fluss clients consume data from Paimon in a streaming fashion—such as seeking by a specific bucket using an offset or timestamp.
3432

35-
## Read tables
33+
```sql title="Flink SQL"
34+
USE CATALOG fluss_catalog;
35+
36+
CREATE TABLE fluss_order_with_lake (
37+
`order_key` BIGINT,
38+
`cust_key` INT NOT NULL,
39+
`total_price` DECIMAL(15, 2),
40+
`order_date` DATE,
41+
`order_priority` STRING,
42+
`clerk` STRING,
43+
`ptime` AS PROCTIME(),
44+
PRIMARY KEY (`order_key`) NOT ENFORCED
45+
) WITH (
46+
'table.datalake.enabled' = 'true',
47+
'table.datalake.freshness' = '30s');
48+
```
49+
50+
Then, the datalake tiering service continuously tiers data from Fluss to Paimon. The parameter `table.datalake.freshness` controls how soon data written to Fluss should be tiered to Paimon—by default, this delay is 3 minutes.
51+
For primary key tables, change logs are also generated in Paimon format, enabling stream-based consumption via Paimon APIs.
52+
53+
Since Fluss version 0.7, you can also specify Paimon table properties when creating a datalake-enabled Fluss table by using the `paimon.` prefix within the Fluss table properties clause.
54+
55+
```sql title="Flink SQL"
56+
CREATE TABLE fluss_order_with_lake (
57+
`order_key` BIGINT,
58+
`cust_key` INT NOT NULL,
59+
`total_price` DECIMAL(15, 2),
60+
`order_date` DATE,
61+
`order_priority` STRING,
62+
`clerk` STRING,
63+
`ptime` AS PROCTIME(),
64+
PRIMARY KEY (`order_key`) NOT ENFORCED
65+
) WITH (
66+
'table.datalake.enabled' = 'true',
67+
'table.datalake.freshness' = '30s',
68+
'paimon.file.format' = 'orc',
69+
'paimon.deletion-vectors.enabled' = 'true');
70+
```
71+
72+
For example, you can specify the Paimon property `file.format` to change the file format of the Paimon table, or set `deletion-vectors.enabled` to enable or disable deletion vectors for the Paimon table.
73+
74+
## Read Tables
3675

3776
### Read by Flink
3877

39-
For the table with option `'table.datalake.enabled' = 'true'`, there are two part of data: the data remains in Fluss and the data already in Paimon.
40-
Now, you have two view of the table: one view is the Paimon data which has minute-level latency, one view is the full data union Fluss and Paimon data
41-
which is the latest within second-level latency.
78+
For a table with the option `'table.datalake.enabled' = 'true'`, its data exists in two layers: one remains in Fluss, and the other has already been tiered to Paimon.
79+
You can choose between two views of the table:
80+
- A **Paimon-only view**, which offers minute-level latency but better analytics performance.
81+
- A **combined view** of both Fluss and Paimon data, which provides second-level latency but may result in slightly degraded query performance.
4282

43-
Flink empowers you to decide to choose which view:
44-
- Only Paimon means a better analytics performance but with worse data freshness
45-
- Combing Fluss and Paimon means a better data freshness but with analytics performance degrading
83+
#### Read Data Only in Paimon
4684

47-
#### Read data only in Paimon
48-
To point to read data in Paimon, you must specify the table with `$lake` suffix, the following
49-
SQL shows how to do that:
85+
To read only data stored in Paimon, use the `$lake` suffix in the table name. The following example demonstrates this:
5086

5187
```sql title="Flink SQL"
52-
-- assume we have a table named `orders`
88+
-- Assume we have a table named `orders`
5389

54-
-- read from paimon
90+
-- Read from Paimon
5591
SELECT COUNT(*) FROM orders$lake;
5692
```
5793

5894
```sql title="Flink SQL"
59-
-- we can also query the system tables
95+
-- We can also query the system tables
6096
SELECT * FROM orders$lake$snapshots;
6197
```
6298

63-
When specify the table with `$lake` suffix in query, it just acts like a normal Paimon table, so it inherits all ability of Paimon table.
64-
You can enjoy all the features that Flink's query supports/optimization on Paimon, like query system tables, time travel, etc. See more
65-
about Paimon's [sql-query](https://paimon.apache.org/docs/0.9/flink/sql-query/#sql-query).
99+
When you specify the `$lake` suffix in a query, the table behaves like a standard Paimon table and inherits all its capabilities.
100+
This allows you to take full advantage of Flink's query support and optimizations on Paimon, such as querying system tables, time travel, and more.
101+
For further information, refer to Paimons [SQL Query documentation](https://paimon.apache.org/docs/0.9/flink/sql-query/#sql-query).
66102

103+
#### Union Read of Data in Fluss and Paimon
67104

68-
#### Union read data in Fluss and Paimon
69-
To point to read the full data that union Fluss and Paimon, you just query it as a normal table without any suffix or others, the following
70-
SQL shows how to do that:
105+
To read the full dataset, which includes both Fluss and Paimon data, simply query the table without any suffix. The following example illustrates this:
71106

72107
```sql title="Flink SQL"
73-
-- query will union data of Fluss and Paimon
74-
SELECT SUM(order_count) as total_orders FROM ads_nation_purchase_power;
108+
-- Query will union data from Fluss and Paimon
109+
SELECT SUM(order_count) AS total_orders FROM ads_nation_purchase_power;
75110
```
76-
The query may look slower than only querying data in Paimon, but it queries the full data which means better data freshness. You can
77-
run the query multi-times, you should get different results in every one run as the data is written to the table continuously.
78111

79-
### Read by other engines
112+
This query may run slower than reading only from Paimon, but it returns the most up-to-date data. If you execute the query multiple times, you may observe different results due to continuous data ingestion.
113+
114+
### Read by Other Engines
115+
116+
Since the data tiered to Paimon from Fluss is stored as a standard Paimon table, you can use any engine that supports Paimon to read it. Below is an example using [StarRocks](https://paimon.apache.org/docs/master/engines/starrocks/):
80117

81-
As the tiered data in Paimon compacted from Fluss is also a standard Paimon table, you can use
82-
[any engines](https://paimon.apache.org/docs/0.9/engines/overview/) that support Paimon to read the data. Here, we take [StarRocks](https://paimon.apache.org/docs/master/engines/starrocks/) as the engine to read the data:
118+
First, create a Paimon catalog in StarRocks:
83119

84-
First, create a Paimon catalog for StarRocks:
85120
```sql title="StarRocks SQL"
86121
CREATE EXTERNAL CATALOG paimon_catalog
87122
PROPERTIES
@@ -92,23 +127,24 @@ PROPERTIES
92127
);
93128
```
94129

95-
**NOTE**: The configuration value `paimon.catalog.type` and `paimon.catalog.warehouse` should be same as how you configure the Paimon as lakehouse storage for Fluss in `server.yaml`.
130+
> **NOTE**: The configuration values for `paimon.catalog.type` and `paimon.catalog.warehouse` must match those used when configuring Paimon as the lakehouse storage for Fluss in `server.yaml`.
131+
132+
Then, you can query the `orders` table using StarRocks:
96133

97-
Then, you can query the `orders` table by StarRocks:
98134
```sql title="StarRocks SQL"
99-
-- the table is in database `fluss`
135+
-- The table is in the database `fluss`
100136
SELECT COUNT(*) FROM paimon_catalog.fluss.orders;
101137
```
102138

103139
```sql title="StarRocks SQL"
104-
-- query the system tables, to know the snapshots of the table
140+
-- Query the system tables to view snapshots of the table
105141
SELECT * FROM paimon_catalog.fluss.enriched_orders$snapshots;
106142
```
107143

108-
109144
## Data Type Mapping
110-
When integrate with Paimon, Fluss automatically converts between Fluss data type and Paimon data type.
111-
The following content shows the mapping between [Fluss data type](table-design/data-types.md) and Paimon data type:
145+
146+
When integrating with Paimon, Fluss automatically converts between Fluss data types and Paimon data types.
147+
The following table shows the mapping between [Fluss data types](table-design/data-types.md) and Paimon data types:
112148

113149
| Fluss Data Type | Paimon Data Type |
114150
|-------------------------------|-------------------------------|
@@ -127,4 +163,4 @@ The following content shows the mapping between [Fluss data type](table-design/d
127163
| TIMESTAMP | TIMESTAMP |
128164
| TIMESTAMP WITH LOCAL TIMEZONE | TIMESTAMP WITH LOCAL TIMEZONE |
129165
| BINARY | BINARY |
130-
| BYTES | BYTES |
166+
| BYTES | BYTES |

0 commit comments

Comments
 (0)