You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/streaming-lakehouse/integrate-data-lakes/iceberg.md
+23-23Lines changed: 23 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,9 +13,9 @@ To integrate Fluss with Iceberg, you must enable lakehouse storage and configure
13
13
> **NOTE**: Iceberg requires JDK11 or later. Please ensure that both your Fluss deployment and the Flink cluster used for tiering services are running on JDK11+.
14
14
15
15
16
-
## ⚙️ Configure Iceberg as LakeHouse Storage
16
+
## Configure Iceberg as LakeHouse Storage
17
17
18
-
### 🔧 Configure Iceberg in Cluster Configurations
18
+
### Configure Iceberg in Cluster Configurations
19
19
20
20
To configure Iceberg as the lakehouse storage, you must configure the following configurations in `server.yaml`:
21
21
```yaml
@@ -27,13 +27,13 @@ datalake.iceberg.type: hadoop
27
27
datalake.iceberg.warehouse: /tmp/iceberg
28
28
```
29
29
30
-
#### 🔧 Configuration Processing
30
+
#### Configuration Processing
31
31
32
32
Fluss processes Iceberg configurations by stripping the `datalake.iceberg.` prefix and uses the stripped configurations (without the prefix `datalake.iceberg.`) to initialize the Iceberg catalog.
33
33
34
34
This approach enables passing custom configurations for Iceberg catalog initialization. Check out the [Iceberg Catalog Properties](https://iceberg.apache.org/docs/1.9.1/configuration/#catalog-properties) for more details on available catalog configurations.
35
35
36
-
#### 📋 Supported Catalog Types
36
+
#### Supported Catalog Types
37
37
38
38
Fluss supports all Iceberg-compatible catalog types:
@@ -95,19 +95,19 @@ Fluss only bundles catalog implementations included in the `iceberg-core` module
95
95
96
96
The Iceberg version that Fluss bundles is based on `1.9.1`. Please ensure the JARs you add are compatible with `Iceberg-1.9.1`.
97
97
98
-
#### ⚠️ Important Notes
98
+
#### Important Notes
99
99
100
100
- Ensure all JAR files are compatible with Iceberg 1.9.1
101
101
- If using an existing Hadoop environment, it's recommended to use the `HADOOP_CLASSPATH` environment variable
102
102
- Configuration changes take effect after restarting the Fluss service
103
103
104
-
### 🚀 Start Tiering Service to Iceberg
104
+
### Start Tiering Service to Iceberg
105
105
106
106
To tier Fluss's data to Iceberg, you must start the datalake tiering service. For guidance, you can refer to [Start The Datalake Tiering Service](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service). Although the example uses Paimon, the process is also applicable to Iceberg.
107
107
108
-
#### 🔧 Prerequisites: Hadoop Dependencies
108
+
#### Prerequisites: Hadoop Dependencies
109
109
110
-
**⚠️ Important**: Iceberg has a strong dependency on Hadoop. You must ensure Hadoop-related classes are available in the classpath before starting the tiering service.
110
+
**Important**: Iceberg has a strong dependency on Hadoop. You must ensure Hadoop-related classes are available in the classpath before starting the tiering service.
111
111
112
112
##### Option 1: Use Existing Hadoop Environment (Recommended)
Follow the dependency management guidelines below for the [Prepare required jars](maintenance/tiered-storage/lakehouse-storage.md#prepare-required-jars) step:
150
150
@@ -176,7 +176,7 @@ iceberg-aws-bundle-1.9.1.jar
176
176
failsafe-3.3.2.jar
177
177
```
178
178
179
-
#### 🚀 Start Datalake Tiering Service
179
+
#### Start Datalake Tiering Service
180
180
181
181
When following the [Start Datalake Tiering Service](maintenance/tiered-storage/lakehouse-storage.md#start-datalake-tiering-service) guide, use Iceberg-specific configurations as parameters when starting the Flink tiering job:
182
182
@@ -188,7 +188,7 @@ When following the [Start Datalake Tiering Service](maintenance/tiered-storage/l
188
188
--datalake.iceberg.warehouse /tmp/iceberg
189
189
```
190
190
191
-
#### ⚠️ Important Notes
191
+
#### Important Notes
192
192
193
193
- Ensure all JAR files are compatible with Iceberg 1.9.1
194
194
- Verify that all required dependencies are in the `${FLINK_HOME}/lib` directory
@@ -202,7 +202,7 @@ When a Fluss table is created or altered with the option `'table.datalake.enable
202
202
The schema of the Iceberg table matches that of the Fluss table, except for the addition of three system columns at the end: `__bucket`, `__offset`, and `__timestamp`.
203
203
These system columns help Fluss clients consume data from Iceberg in a streaming fashion, such as seeking by a specific bucket using an offset or timestamp.
204
204
205
-
### 🔧 Basic Configuration
205
+
### Basic Configuration
206
206
207
207
Here is an example using Flink SQL to create a table with data lake enabled:
You can also specify Iceberg [table properties](https://iceberg.apache.org/docs/latest/configuration/#table-properties) when creating a datalake-enabled Fluss table by using the `iceberg.` prefix within the Fluss table properties clause.
Primary key tables in Fluss are mapped to Iceberg tables with:
255
255
@@ -289,7 +289,7 @@ CREATE TABLE user_profiles (
289
289
SORTED BY (__offset ASC);
290
290
```
291
291
292
-
### 📝 Log Tables
292
+
### Log Tables
293
293
294
294
The table mapping for Fluss log tables varies depending on whether the bucket key is specified or not.
295
295
@@ -360,7 +360,7 @@ CREATE TABLE order_events (
360
360
SORTED BY (__offset ASC);
361
361
```
362
362
363
-
### 🗂️ Partitioned Tables
363
+
### Partitioned Tables
364
364
365
365
For Fluss partitioned tables, Iceberg first partitions by Fluss partition keys, then follows the above rules:
366
366
@@ -394,7 +394,7 @@ CREATE TABLE daily_sales (
394
394
SORTED BY (__offset ASC);
395
395
```
396
396
397
-
### 📊 System Columns
397
+
### System Columns
398
398
399
399
All Iceberg tables created by Fluss include three system columns:
400
400
@@ -406,7 +406,7 @@ All Iceberg tables created by Fluss include three system columns:
406
406
407
407
## Read Tables
408
408
409
-
### 🐿️ Reading with Apache Flink
409
+
### Reading with Apache Flink
410
410
411
411
When a table has the configuration `table.datalake.enabled = 'true'`, its data exists in two layers:
412
412
@@ -444,7 +444,7 @@ Key behavior for data retention:
444
444
-**Expired Fluss log data** (controlled by `table.log.ttl`) remains accessible via Iceberg if previously tiered
445
445
-**Cleaned-up partitions** in partitioned tables (controlled by `table.auto-partition.num-retention`) remain accessible via Iceberg if previously tiered
446
446
447
-
### 🔍 Reading with Other Engines
447
+
### Reading with Other Engines
448
448
449
449
Since data tiered to Iceberg from Fluss is stored as standard Iceberg tables, you can use any Iceberg-compatible engine. Below is an example using [StarRocks](https://docs.starrocks.io/docs/data_source/catalog/iceberg/iceberg_catalog/):
450
450
@@ -504,7 +504,7 @@ When integrating with Iceberg, Fluss automatically converts between Fluss data t
504
504
505
505
## Maintenance and Optimization
506
506
507
-
### 📦 Auto Compaction
507
+
### Auto Compaction
508
508
509
509
The table option `table.datalake.auto-compaction` (disabled by default) provides per-table control over automatic compaction.
510
510
When enabled for a specific table, compaction is automatically triggered during write operations to that table by the tiering service.
@@ -528,7 +528,7 @@ CREATE TABLE example_table (
528
528
-**Storage**: Optimizes storage usage by removing duplicate data
529
529
-**Maintenance**: Automatically handles data organization
530
530
531
-
### 📊 Snapshot Metadata
531
+
### Snapshot Metadata
532
532
533
533
Fluss adds specific metadata to Iceberg snapshots for traceability:
534
534
@@ -578,7 +578,7 @@ For partitioned tables, the metadata structure includes partition information:
578
578
|`offset`| Offset within the partition's log |`3`, `1000`|
579
579
580
580
581
-
## 🚫 Current Limitations
581
+
## Current Limitations
582
582
583
583
-**Complex Types**: Array, Map, and Row types are not supported
584
584
-**Multiple bucket keys**: Not supported until Iceberg implements multi-argument partition transforms
0 commit comments