Skip to content

Commit 5c4225e

Browse files
gyang94polyzos
andauthored
[blog]: hands on fluss lakehouse (apache#1279)
* [blog]: hands on fluss lakehouse * feat: hands on fluss lakehouse with paimon s3 * make a few adjustments --------- Co-authored-by: ipolyzos <[email protected]>
1 parent 2dd0d95 commit 5c4225e

File tree

6 files changed

+370
-1
lines changed

6 files changed

+370
-1
lines changed
Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
---
2+
slug: hands-on-fluss-lakehouse
3+
title: "Hands-on Fluss Lakehouse with Paimon S3"
4+
authors: [gyang94]
5+
toc_max_heading_level: 5
6+
---
7+
8+
<!--
9+
Licensed to the Apache Software Foundation (ASF) under one
10+
or more contributor license agreements. See the NOTICE file
11+
distributed with this work for additional information
12+
regarding copyright ownership. The ASF licenses this file
13+
to you under the Apache License, Version 2.0 (the
14+
"License"); you may not use this file except in compliance
15+
with the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
-->
25+
26+
# Hands-on Fluss Lakehouse with Paimon S3
27+
28+
Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing.
29+
30+
![](assets/hands_on_fluss_lakehouse/streamhouse.png)
31+
32+
In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities.
33+
34+
## Integrate Paimon S3 Lakehouse
35+
36+
For this tutorial, we'll use **Fluss 0.7** and **Flink 1.20** to run the tiering service on a local cluster. We'll configure **Paimon** as our lake format and **S3** as the storage backend. Let's get started:
37+
38+
### Minio Setup
39+
40+
1. Install Minio object storage locally.
41+
42+
Check out the official [guide](https://min.io/docs/minio/macos/index.html) for detailed instructions.
43+
44+
2. Start the Minio server
45+
46+
Run this command, specifying a local path to store your Minio data:
47+
```
48+
minio server /tmp/minio-data
49+
```
50+
51+
3. Verify the Minio WebUI.
52+
53+
When your Minio server is up and running, you'll see endpoint information and login credentials:
54+
55+
```
56+
API: http://192.168.2.236:9000 http://127.0.0.1:9000
57+
RootUser: minioadmin
58+
RootPass: minioadmin
59+
60+
WebUI: http://192.168.2.236:61832 http://127.0.0.1:61832
61+
RootUser: minioadmin
62+
RootPass: minioadmin
63+
```
64+
Open the WebUI link and log in with these credentials.
65+
66+
4. Create a `fluss` bucket through the WebUI.
67+
68+
![](assets/hands_on_fluss_lakehouse/fluss-bucket.png)
69+
70+
71+
### Fluss Cluster Setup
72+
73+
1. Download Fluss
74+
75+
Grab the Fluss 0.7 binary release from the [Fluss official site](https://fluss.apache.org/downloads/).
76+
77+
2. Add Dependencies
78+
79+
Download the `fluss-fs-s3-0.7.0.jar` from the [Fluss official site](https://fluss.apache.org/downloads/) and place it in your `<FLUSS_HOME>/lib` directory.
80+
81+
Next, download the `paimon-s3-1.0.1.jar` from the [Paimon official site](https://paimon.apache.org/docs/1.0/project/download/) and add it to `<FLUSS_HOME>/plugins/paimon`.
82+
83+
3. Configure the Data Lake
84+
85+
Edit your `<FLUSS_HOME>/conf/server.yaml` file and add these settings:
86+
87+
```yaml
88+
data.dir: /tmp/fluss-data
89+
remote.data.dir: /tmp/fluss-remote-data
90+
91+
datalake.format: paimon
92+
datalake.paimon.metastore: filesystem
93+
datalake.paimon.warehouse: s3://fluss/data
94+
datalake.paimon.s3.endpoint: http://localhost:9000
95+
datalake.paimon.s3.access-key: minioadmin
96+
datalake.paimon.s3.secret-key: minioadmin
97+
datalake.paimon.s3.path.style.access: true
98+
```
99+
100+
This configures Paimon as the datalake format with S3 as the warehouse.
101+
102+
4. Start Fluss
103+
104+
```bash
105+
<FLUSS_HOME>/bin/local-cluster.sh start
106+
```
107+
108+
### Flink Cluster Setup
109+
110+
1. Download Flink
111+
112+
Download the Flink 1.20 binary package from the [Flink downloads page](https://flink.apache.org/downloads/).
113+
114+
2. Add the Fluss Connector
115+
116+
Download `fluss-flink-1.20-0.7.0.jar` from the [Fluss official site](https://fluss.apache.org/downloads/) and copy it to:
117+
118+
```
119+
<FLINK_HOME>/lib
120+
```
121+
122+
3. Add Paimon Dependencies
123+
124+
- Download `paimon-flink-1.20-1.0.1.jar` and `paimon-s3-1.0.1.jar` from the [Paimon official site](https://paimon.apache.org/docs/1.0/project/download/) and place them in `<FLINK_HOME>/lib`.
125+
- Copy these Paimon plugin jars from Fluss into `<FLINK_HOME>/lib`:
126+
127+
```
128+
<FLUSS_HOME>/plugins/paimon/fluss-lake-paimon-0.7.0.jar
129+
<FLUSS_HOME>/plugins/paimon/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
130+
```
131+
132+
4. Increase Task Slots
133+
134+
Edit `<FLINK_HOME>/conf/config.yaml` to increase available task slots:
135+
136+
```yaml
137+
numberOfTaskSlots: 5
138+
```
139+
140+
5. Start Flink
141+
142+
```bash
143+
<FLINK_HOME>/bin/start-cluster.sh
144+
```
145+
146+
6. Verify
147+
148+
Open your browser to `http://localhost:8081/` and make sure the cluster is running.
149+
150+
### Launching the Tiering Service
151+
152+
1. Get the Tiering Job Jar
153+
154+
Download the `fluss-flink-tiering-0.7.0.jar`.
155+
156+
2. Submit the Job
157+
158+
```bash
159+
<FLINK_HOME>/bin/flink run \
160+
<path_to_jar>/fluss-flink-tiering-0.7.0.jar \
161+
--fluss.bootstrap.servers localhost:9123 \
162+
--datalake.format paimon \
163+
--datalake.paimon.metastore filesystem \
164+
--datalake.paimon.warehouse s3://fluss/data \
165+
--datalake.paimon.s3.endpoint http://localhost:9000 \
166+
--datalake.paimon.s3.access-key minioadmin \
167+
--datalake.paimon.s3.secret-key minioadmin \
168+
--datalake.paimon.s3.path.style.access true
169+
```
170+
171+
3. Confirm Deployment
172+
173+
Check the Flink UI for the **Fluss Lake Tiering Service** job. Once it's running, your local tiering pipeline is good to go.
174+
175+
![](assets/hands_on_fluss_lakehouse/tiering-serivce-job.png)
176+
177+
## Data Processing
178+
179+
Now let's dive into some actual data processing. We'll use the Flink SQL Client to interact with our Fluss lakehouse and run both batch and streaming queries.
180+
181+
1. Launch the SQL Client
182+
183+
```bash
184+
<FLINK_HOME>/bin/sql-client.sh
185+
```
186+
187+
2. Create the Catalog and Table
188+
189+
```sql
190+
CREATE CATALOG fluss_catalog WITH (
191+
'type' = 'fluss',
192+
'bootstrap.servers' = 'localhost:9123'
193+
);
194+
195+
USE CATALOG fluss_catalog;
196+
197+
CREATE TABLE t_user (
198+
`id` BIGINT,
199+
`name` string NOT NULL,
200+
`age` int,
201+
`birth` DATE,
202+
PRIMARY KEY (`id`) NOT ENFORCED
203+
)WITH (
204+
'table.datalake.enabled' = 'true',
205+
'table.datalake.freshness' = '30s'
206+
);
207+
```
208+
209+
3. Write Some Data
210+
211+
Let's insert a couple of records:
212+
213+
```sql
214+
SET 'execution.runtime-mode' = 'batch';
215+
SET 'sql-client.execution.result-mode' = 'tableau';
216+
217+
INSERT INTO t_user(id,name,age,birth) VALUES
218+
(1,'Alice',18,DATE '2000-06-10'),
219+
(2,'Bob',20,DATE '2001-06-20');
220+
```
221+
222+
4. Union Read
223+
224+
Now run a simple query to retrieve data from the table. By default, Flink will automatically combine data from both the Fluss cluster and the lakehouse:
225+
226+
```sql
227+
Flink SQL> select * from t_user;
228+
+----+-------+-----+------------+
229+
| id | name | age | birth |
230+
+----+-------+-----+------------+
231+
| 1 | Alice | 18 | 2000-06-10 |
232+
| 2 | Bob | 20 | 2001-06-20 |
233+
+----+-------+-----+------------+
234+
```
235+
236+
If you want to read data only from the lake table, simply append `$lake` after the table name:
237+
238+
```sql
239+
Flink SQL> select * from t_user$lake;
240+
+----+-------+-----+------------+----------+----------+----------------------------+
241+
| id | name | age | birth | __bucket | __offset | __timestamp |
242+
+----+-------+-----+------------+----------+----------+----------------------------+
243+
| 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
244+
| 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |
245+
+----+-------+-----+------------+----------+----------+----------------------------+
246+
```
247+
248+
Great! Our records have been successfully synced to the data lake by the tiering service.
249+
250+
Notice the three system columns in the Paimon lake table: `__bucket`, `__offset`, and `__timestamp`. The `__bucket` column shows which bucket contains this row. The `__offset` and `__timestamp` columns are used for streaming data processing.
251+
252+
5. Streaming Inserts
253+
254+
Let's switch to streaming mode and add two more records:
255+
256+
```sql
257+
Flink SQL> SET 'execution.runtime-mode' = 'streaming';
258+
259+
Flink SQL> INSERT INTO t_user(id,name,age,birth) VALUES
260+
(3,'Catlin',25,DATE '2002-06-10'),
261+
(4,'Dylan',28,DATE '2003-06-20');
262+
```
263+
264+
Now query the lake again:
265+
266+
```sql
267+
Flink SQL> select * from t_user$lake;
268+
+----+----+--------+-----+------------+----------+----------+----------------------------+
269+
| op | id | name | age | birth | __bucket | __offset | __timestamp |
270+
+----+----+--------+-----+------------+----------+----------+----------------------------+
271+
| +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
272+
| +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |
273+
274+
275+
Flink SQL> select * from t_user$lake;
276+
+----+----+--------+-----+------------+----------+----------+----------------------------+
277+
| op | id | name | age | birth | __bucket | __offset | __timestamp |
278+
+----+----+--------+-----+------------+----------+----------+----------------------------+
279+
| +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
280+
| +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |
281+
| +I | 3 | Catlin | 25 | 2002-06-10 | 0 | 2 | 2025-07-19 19:03:54.150000 |
282+
| +I | 4 | Dylan | 28 | 2003-06-20 | 0 | 3 | 2025-07-19 19:03:54.150000 |
283+
284+
```
285+
286+
The first time we queried, our new records hadn't been synced to the lake table yet. After waiting a moment, they appeared.
287+
288+
Notice that the `__offset` and `__timestamp` values for these new records are no longer the default values. They now show the actual offset and timestamp when the records were added to the table.
289+
290+
6. Inspect the Paimon Files
291+
292+
Open the Minio WebUI, and you'll see the Paimon files in your bucket:
293+
294+
![](assets/hands_on_fluss_lakehouse/fluss-bucket-data.png)
295+
296+
You can also check the Parquet files and manifest in your local filesystem under `/tmp/minio-data`:
297+
298+
```
299+
/tmp/minio-data ❯ tree .
300+
.
301+
└── fluss
302+
└── data
303+
├── default.db__XLDIR__
304+
│ └── xl.meta
305+
└── fluss.db
306+
└── t_user
307+
├── bucket-0
308+
│ ├── changelog-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-0.parquet
309+
│ │ └── xl.meta
310+
│ ├── changelog-f1853f1c-2588-4035-8233-e4804b1d8344-0.parquet
311+
│ │ └── xl.meta
312+
│ ├── data-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-1.parquet
313+
│ │ └── xl.meta
314+
│ └── data-f1853f1c-2588-4035-8233-e4804b1d8344-1.parquet
315+
│ └── xl.meta
316+
├── manifest
317+
│ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-0
318+
│ │ └── xl.meta
319+
│ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-1
320+
│ │ └── xl.meta
321+
│ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-0
322+
│ │ └── xl.meta
323+
│ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-1
324+
│ │ └── xl.meta
325+
│ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-0
326+
│ │ └── xl.meta
327+
│ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-1
328+
│ │ └── xl.meta
329+
│ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-2
330+
│ │ └── xl.meta
331+
│ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-0
332+
│ │ └── xl.meta
333+
│ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-1
334+
│ │ └── xl.meta
335+
│ └── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-2
336+
│ └── xl.meta
337+
├── schema
338+
│ └── schema-0
339+
│ └── xl.meta
340+
└── snapshot
341+
├── LATEST
342+
│ └── xl.meta
343+
├── snapshot-1
344+
│ └── xl.meta
345+
└── snapshot-2
346+
└── xl.meta
347+
348+
28 directories, 19 files
349+
```
350+
351+
7. View Snapshots
352+
353+
You can also check the snapshots from the system table by appending `$lake$snapshots` after the Fluss table name:
354+
355+
```sql
356+
Flink SQL> select * from t_user$lake$snapshots;
357+
358+
+-------------+-----------+----------------------+-------------------------+-------------+----------+
359+
| snapshot_id | schema_id | commit_user | commit_time | commit_kind | ... |
360+
+-------------+-----------+----------------------+-------------------------+-------------+----------+
361+
| 1 | 0 | __fluss_lake_tiering | 2025-07-19 19:00:41.286 | APPEND | ... |
362+
| 2 | 0 | __fluss_lake_tiering | 2025-07-19 19:04:38.964 | APPEND | ... |
363+
+-------------+-----------+----------------------+-------------------------+-------------+----------+
364+
2 rows in set (0.33 seconds)
365+
```
366+
367+
## Summary
368+
369+
In this guide, we've explored the Fluss lakehouse architecture and set up a complete local environment with Fluss, Flink, Paimon, and S3. We've walked through practical examples of data processing that showcase how Fluss seamlessly integrates real-time and historical data. With this setup, you now have a solid foundation for experimenting with Fluss's powerful lakehouse capabilities on your own machine.
43.6 KB
Loading
409 KB
Loading
237 KB
Loading
64.3 KB
Loading

website/blog/authors.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ yuxia:
3535
image_url: https://github.com/luoyuxia.png
3636

3737
gyang94:
38-
name: GUO Yang
38+
name: Yang Guo
3939
title: Fluss Contributor
4040
url: https://github.com/gyang94
4141
image_url: https://github.com/gyang94.png

0 commit comments

Comments
 (0)