|
| 1 | +--- |
| 2 | +slug: hands-on-fluss-lakehouse |
| 3 | +title: "Hands-on Fluss Lakehouse with Paimon S3" |
| 4 | +authors: [gyang94] |
| 5 | +toc_max_heading_level: 5 |
| 6 | +--- |
| 7 | + |
| 8 | +<!-- |
| 9 | + Licensed to the Apache Software Foundation (ASF) under one |
| 10 | + or more contributor license agreements. See the NOTICE file |
| 11 | + distributed with this work for additional information |
| 12 | + regarding copyright ownership. The ASF licenses this file |
| 13 | + to you under the Apache License, Version 2.0 (the |
| 14 | + "License"); you may not use this file except in compliance |
| 15 | + with the License. You may obtain a copy of the License at |
| 16 | +
|
| 17 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +
|
| 19 | + Unless required by applicable law or agreed to in writing, software |
| 20 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 21 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 22 | + See the License for the specific language governing permissions and |
| 23 | + limitations under the License. |
| 24 | +--> |
| 25 | + |
| 26 | +# Hands-on Fluss Lakehouse with Paimon S3 |
| 27 | + |
| 28 | +Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities. |
| 33 | + |
| 34 | +## Integrate Paimon S3 Lakehouse |
| 35 | + |
| 36 | +For this tutorial, we'll use **Fluss 0.7** and **Flink 1.20** to run the tiering service on a local cluster. We'll configure **Paimon** as our lake format and **S3** as the storage backend. Let's get started: |
| 37 | + |
| 38 | +### Minio Setup |
| 39 | + |
| 40 | +1. Install Minio object storage locally. |
| 41 | + |
| 42 | + Check out the official [guide](https://min.io/docs/minio/macos/index.html) for detailed instructions. |
| 43 | + |
| 44 | +2. Start the Minio server |
| 45 | + |
| 46 | + Run this command, specifying a local path to store your Minio data: |
| 47 | + ``` |
| 48 | + minio server /tmp/minio-data |
| 49 | + ``` |
| 50 | + |
| 51 | +3. Verify the Minio WebUI. |
| 52 | + |
| 53 | + When your Minio server is up and running, you'll see endpoint information and login credentials: |
| 54 | + |
| 55 | + ``` |
| 56 | + API: http://192.168.2.236:9000 http://127.0.0.1:9000 |
| 57 | + RootUser: minioadmin |
| 58 | + RootPass: minioadmin |
| 59 | +
|
| 60 | + WebUI: http://192.168.2.236:61832 http://127.0.0.1:61832 |
| 61 | + RootUser: minioadmin |
| 62 | + RootPass: minioadmin |
| 63 | + ``` |
| 64 | + Open the WebUI link and log in with these credentials. |
| 65 | + |
| 66 | +4. Create a `fluss` bucket through the WebUI. |
| 67 | + |
| 68 | +  |
| 69 | + |
| 70 | + |
| 71 | +### Fluss Cluster Setup |
| 72 | + |
| 73 | +1. Download Fluss |
| 74 | + |
| 75 | + Grab the Fluss 0.7 binary release from the [Fluss official site](https://fluss.apache.org/downloads/). |
| 76 | + |
| 77 | +2. Add Dependencies |
| 78 | + |
| 79 | + Download the `fluss-fs-s3-0.7.0.jar` from the [Fluss official site](https://fluss.apache.org/downloads/) and place it in your `<FLUSS_HOME>/lib` directory. |
| 80 | + |
| 81 | + Next, download the `paimon-s3-1.0.1.jar` from the [Paimon official site](https://paimon.apache.org/docs/1.0/project/download/) and add it to `<FLUSS_HOME>/plugins/paimon`. |
| 82 | + |
| 83 | +3. Configure the Data Lake |
| 84 | + |
| 85 | + Edit your `<FLUSS_HOME>/conf/server.yaml` file and add these settings: |
| 86 | + |
| 87 | + ```yaml |
| 88 | + data.dir: /tmp/fluss-data |
| 89 | + remote.data.dir: /tmp/fluss-remote-data |
| 90 | + |
| 91 | + datalake.format: paimon |
| 92 | + datalake.paimon.metastore: filesystem |
| 93 | + datalake.paimon.warehouse: s3://fluss/data |
| 94 | + datalake.paimon.s3.endpoint: http://localhost:9000 |
| 95 | + datalake.paimon.s3.access-key: minioadmin |
| 96 | + datalake.paimon.s3.secret-key: minioadmin |
| 97 | + datalake.paimon.s3.path.style.access: true |
| 98 | + ``` |
| 99 | +
|
| 100 | + This configures Paimon as the datalake format with S3 as the warehouse. |
| 101 | +
|
| 102 | +4. Start Fluss |
| 103 | +
|
| 104 | + ```bash |
| 105 | + <FLUSS_HOME>/bin/local-cluster.sh start |
| 106 | + ``` |
| 107 | + |
| 108 | +### Flink Cluster Setup |
| 109 | + |
| 110 | +1. Download Flink |
| 111 | + |
| 112 | + Download the Flink 1.20 binary package from the [Flink downloads page](https://flink.apache.org/downloads/). |
| 113 | + |
| 114 | +2. Add the Fluss Connector |
| 115 | + |
| 116 | + Download `fluss-flink-1.20-0.7.0.jar` from the [Fluss official site](https://fluss.apache.org/downloads/) and copy it to: |
| 117 | + |
| 118 | + ``` |
| 119 | + <FLINK_HOME>/lib |
| 120 | + ``` |
| 121 | + |
| 122 | +3. Add Paimon Dependencies |
| 123 | + |
| 124 | + - Download `paimon-flink-1.20-1.0.1.jar` and `paimon-s3-1.0.1.jar` from the [Paimon official site](https://paimon.apache.org/docs/1.0/project/download/) and place them in `<FLINK_HOME>/lib`. |
| 125 | + - Copy these Paimon plugin jars from Fluss into `<FLINK_HOME>/lib`: |
| 126 | + |
| 127 | + ``` |
| 128 | + <FLUSS_HOME>/plugins/paimon/fluss-lake-paimon-0.7.0.jar |
| 129 | + <FLUSS_HOME>/plugins/paimon/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar |
| 130 | + ``` |
| 131 | + |
| 132 | +4. Increase Task Slots |
| 133 | + |
| 134 | + Edit `<FLINK_HOME>/conf/config.yaml` to increase available task slots: |
| 135 | + |
| 136 | + ```yaml |
| 137 | + numberOfTaskSlots: 5 |
| 138 | + ``` |
| 139 | +
|
| 140 | +5. Start Flink |
| 141 | +
|
| 142 | + ```bash |
| 143 | + <FLINK_HOME>/bin/start-cluster.sh |
| 144 | + ``` |
| 145 | + |
| 146 | +6. Verify |
| 147 | + |
| 148 | + Open your browser to `http://localhost:8081/` and make sure the cluster is running. |
| 149 | + |
| 150 | +### Launching the Tiering Service |
| 151 | + |
| 152 | +1. Get the Tiering Job Jar |
| 153 | + |
| 154 | + Download the `fluss-flink-tiering-0.7.0.jar`. |
| 155 | + |
| 156 | +2. Submit the Job |
| 157 | + |
| 158 | + ```bash |
| 159 | + <FLINK_HOME>/bin/flink run \ |
| 160 | + <path_to_jar>/fluss-flink-tiering-0.7.0.jar \ |
| 161 | + --fluss.bootstrap.servers localhost:9123 \ |
| 162 | + --datalake.format paimon \ |
| 163 | + --datalake.paimon.metastore filesystem \ |
| 164 | + --datalake.paimon.warehouse s3://fluss/data \ |
| 165 | + --datalake.paimon.s3.endpoint http://localhost:9000 \ |
| 166 | + --datalake.paimon.s3.access-key minioadmin \ |
| 167 | + --datalake.paimon.s3.secret-key minioadmin \ |
| 168 | + --datalake.paimon.s3.path.style.access true |
| 169 | + ``` |
| 170 | + |
| 171 | +3. Confirm Deployment |
| 172 | + |
| 173 | + Check the Flink UI for the **Fluss Lake Tiering Service** job. Once it's running, your local tiering pipeline is good to go. |
| 174 | + |
| 175 | +  |
| 176 | + |
| 177 | +## Data Processing |
| 178 | + |
| 179 | +Now let's dive into some actual data processing. We'll use the Flink SQL Client to interact with our Fluss lakehouse and run both batch and streaming queries. |
| 180 | + |
| 181 | +1. Launch the SQL Client |
| 182 | + |
| 183 | + ```bash |
| 184 | + <FLINK_HOME>/bin/sql-client.sh |
| 185 | + ``` |
| 186 | + |
| 187 | +2. Create the Catalog and Table |
| 188 | + |
| 189 | + ```sql |
| 190 | + CREATE CATALOG fluss_catalog WITH ( |
| 191 | + 'type' = 'fluss', |
| 192 | + 'bootstrap.servers' = 'localhost:9123' |
| 193 | + ); |
| 194 | + |
| 195 | + USE CATALOG fluss_catalog; |
| 196 | + |
| 197 | + CREATE TABLE t_user ( |
| 198 | + `id` BIGINT, |
| 199 | + `name` string NOT NULL, |
| 200 | + `age` int, |
| 201 | + `birth` DATE, |
| 202 | + PRIMARY KEY (`id`) NOT ENFORCED |
| 203 | + )WITH ( |
| 204 | + 'table.datalake.enabled' = 'true', |
| 205 | + 'table.datalake.freshness' = '30s' |
| 206 | + ); |
| 207 | + ``` |
| 208 | + |
| 209 | +3. Write Some Data |
| 210 | + |
| 211 | + Let's insert a couple of records: |
| 212 | + |
| 213 | + ```sql |
| 214 | + SET 'execution.runtime-mode' = 'batch'; |
| 215 | + SET 'sql-client.execution.result-mode' = 'tableau'; |
| 216 | + |
| 217 | + INSERT INTO t_user(id,name,age,birth) VALUES |
| 218 | + (1,'Alice',18,DATE '2000-06-10'), |
| 219 | + (2,'Bob',20,DATE '2001-06-20'); |
| 220 | + ``` |
| 221 | + |
| 222 | +4. Union Read |
| 223 | + |
| 224 | + Now run a simple query to retrieve data from the table. By default, Flink will automatically combine data from both the Fluss cluster and the lakehouse: |
| 225 | + |
| 226 | + ```sql |
| 227 | + Flink SQL> select * from t_user; |
| 228 | + +----+-------+-----+------------+ |
| 229 | + | id | name | age | birth | |
| 230 | + +----+-------+-----+------------+ |
| 231 | + | 1 | Alice | 18 | 2000-06-10 | |
| 232 | + | 2 | Bob | 20 | 2001-06-20 | |
| 233 | + +----+-------+-----+------------+ |
| 234 | + ``` |
| 235 | + |
| 236 | + If you want to read data only from the lake table, simply append `$lake` after the table name: |
| 237 | + |
| 238 | + ```sql |
| 239 | + Flink SQL> select * from t_user$lake; |
| 240 | + +----+-------+-----+------------+----------+----------+----------------------------+ |
| 241 | + | id | name | age | birth | __bucket | __offset | __timestamp | |
| 242 | + +----+-------+-----+------------+----------+----------+----------------------------+ |
| 243 | + | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 244 | + | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 245 | + +----+-------+-----+------------+----------+----------+----------------------------+ |
| 246 | + ``` |
| 247 | + |
| 248 | + Great! Our records have been successfully synced to the data lake by the tiering service. |
| 249 | + |
| 250 | + Notice the three system columns in the Paimon lake table: `__bucket`, `__offset`, and `__timestamp`. The `__bucket` column shows which bucket contains this row. The `__offset` and `__timestamp` columns are used for streaming data processing. |
| 251 | + |
| 252 | +5. Streaming Inserts |
| 253 | + |
| 254 | + Let's switch to streaming mode and add two more records: |
| 255 | + |
| 256 | + ```sql |
| 257 | + Flink SQL> SET 'execution.runtime-mode' = 'streaming'; |
| 258 | + |
| 259 | + Flink SQL> INSERT INTO t_user(id,name,age,birth) VALUES |
| 260 | + (3,'Catlin',25,DATE '2002-06-10'), |
| 261 | + (4,'Dylan',28,DATE '2003-06-20'); |
| 262 | + ``` |
| 263 | + |
| 264 | + Now query the lake again: |
| 265 | + |
| 266 | + ```sql |
| 267 | + Flink SQL> select * from t_user$lake; |
| 268 | + +----+----+--------+-----+------------+----------+----------+----------------------------+ |
| 269 | + | op | id | name | age | birth | __bucket | __offset | __timestamp | |
| 270 | + +----+----+--------+-----+------------+----------+----------+----------------------------+ |
| 271 | + | +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 272 | + | +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 273 | + |
| 274 | + |
| 275 | + Flink SQL> select * from t_user$lake; |
| 276 | + +----+----+--------+-----+------------+----------+----------+----------------------------+ |
| 277 | + | op | id | name | age | birth | __bucket | __offset | __timestamp | |
| 278 | + +----+----+--------+-----+------------+----------+----------+----------------------------+ |
| 279 | + | +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 280 | + | +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 | |
| 281 | + | +I | 3 | Catlin | 25 | 2002-06-10 | 0 | 2 | 2025-07-19 19:03:54.150000 | |
| 282 | + | +I | 4 | Dylan | 28 | 2003-06-20 | 0 | 3 | 2025-07-19 19:03:54.150000 | |
| 283 | + |
| 284 | + ``` |
| 285 | + |
| 286 | + The first time we queried, our new records hadn't been synced to the lake table yet. After waiting a moment, they appeared. |
| 287 | + |
| 288 | + Notice that the `__offset` and `__timestamp` values for these new records are no longer the default values. They now show the actual offset and timestamp when the records were added to the table. |
| 289 | + |
| 290 | +6. Inspect the Paimon Files |
| 291 | + |
| 292 | + Open the Minio WebUI, and you'll see the Paimon files in your bucket: |
| 293 | + |
| 294 | +  |
| 295 | + |
| 296 | + You can also check the Parquet files and manifest in your local filesystem under `/tmp/minio-data`: |
| 297 | + |
| 298 | + ``` |
| 299 | + /tmp/minio-data ❯ tree . |
| 300 | + . |
| 301 | + └── fluss |
| 302 | + └── data |
| 303 | + ├── default.db__XLDIR__ |
| 304 | + │ └── xl.meta |
| 305 | + └── fluss.db |
| 306 | + └── t_user |
| 307 | + ├── bucket-0 |
| 308 | + │ ├── changelog-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-0.parquet |
| 309 | + │ │ └── xl.meta |
| 310 | + │ ├── changelog-f1853f1c-2588-4035-8233-e4804b1d8344-0.parquet |
| 311 | + │ │ └── xl.meta |
| 312 | + │ ├── data-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-1.parquet |
| 313 | + │ │ └── xl.meta |
| 314 | + │ └── data-f1853f1c-2588-4035-8233-e4804b1d8344-1.parquet |
| 315 | + │ └── xl.meta |
| 316 | + ├── manifest |
| 317 | + │ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-0 |
| 318 | + │ │ └── xl.meta |
| 319 | + │ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-1 |
| 320 | + │ │ └── xl.meta |
| 321 | + │ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-0 |
| 322 | + │ │ └── xl.meta |
| 323 | + │ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-1 |
| 324 | + │ │ └── xl.meta |
| 325 | + │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-0 |
| 326 | + │ │ └── xl.meta |
| 327 | + │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-1 |
| 328 | + │ │ └── xl.meta |
| 329 | + │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-2 |
| 330 | + │ │ └── xl.meta |
| 331 | + │ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-0 |
| 332 | + │ │ └── xl.meta |
| 333 | + │ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-1 |
| 334 | + │ │ └── xl.meta |
| 335 | + │ └── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-2 |
| 336 | + │ └── xl.meta |
| 337 | + ├── schema |
| 338 | + │ └── schema-0 |
| 339 | + │ └── xl.meta |
| 340 | + └── snapshot |
| 341 | + ├── LATEST |
| 342 | + │ └── xl.meta |
| 343 | + ├── snapshot-1 |
| 344 | + │ └── xl.meta |
| 345 | + └── snapshot-2 |
| 346 | + └── xl.meta |
| 347 | +
|
| 348 | + 28 directories, 19 files |
| 349 | + ``` |
| 350 | + |
| 351 | +7. View Snapshots |
| 352 | + |
| 353 | + You can also check the snapshots from the system table by appending `$lake$snapshots` after the Fluss table name: |
| 354 | + |
| 355 | + ```sql |
| 356 | + Flink SQL> select * from t_user$lake$snapshots; |
| 357 | + |
| 358 | + +-------------+-----------+----------------------+-------------------------+-------------+----------+ |
| 359 | + | snapshot_id | schema_id | commit_user | commit_time | commit_kind | ... | |
| 360 | + +-------------+-----------+----------------------+-------------------------+-------------+----------+ |
| 361 | + | 1 | 0 | __fluss_lake_tiering | 2025-07-19 19:00:41.286 | APPEND | ... | |
| 362 | + | 2 | 0 | __fluss_lake_tiering | 2025-07-19 19:04:38.964 | APPEND | ... | |
| 363 | + +-------------+-----------+----------------------+-------------------------+-------------+----------+ |
| 364 | + 2 rows in set (0.33 seconds) |
| 365 | + ``` |
| 366 | + |
| 367 | +## Summary |
| 368 | + |
| 369 | +In this guide, we've explored the Fluss lakehouse architecture and set up a complete local environment with Fluss, Flink, Paimon, and S3. We've walked through practical examples of data processing that showcase how Fluss seamlessly integrates real-time and historical data. With this setup, you now have a solid foundation for experimenting with Fluss's powerful lakehouse capabilities on your own machine. |
0 commit comments