Skip to content

Commit ce4081c

Browse files
authored
[lake/docs] Update lakehouse storage pages to adapt to new architecture (#1117)
1 parent 009120e commit ce4081c

File tree

1 file changed

+60
-23
lines changed

1 file changed

+60
-23
lines changed

website/docs/maintenance/tiered-storage/lakehouse-storage.md

Lines changed: 60 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -27,51 +27,88 @@ Lakehouse combines data lake scalability and cost-effectiveness with data wareho
2727
Fluss leverages the well-known Lakehouse storage solutions like Apache Paimon, Apache Iceberg, Apache Hudi, Delta Lake as
2828
the tiered storage layer. Currently, only Apache Paimon is supported, but more kinds of Lakehouse storage support are on the way.
2929

30-
Fluss's datalake tiering service will compact Fluss's data to the Lakehouse storage continuously. The data in Lakehouse storage can be read both by Fluss's client in a streaming manner and accessed directly
30+
Fluss's datalake tiering service will tier Fluss's data to the Lakehouse storage continuously. The data in Lakehouse storage can be read both by Fluss's client in a streaming manner and accessed directly
3131
by external systems such as Flink, Spark, StarRocks and others. With data tiered in Lakehouse storage, Fluss
3232
can gain much storage cost reduction and analytics performance improvement.
3333

3434

3535
## Enable Lakehouse Storage
3636

37-
Lakehouse Storage disabled by default, you must enable it manually.
37+
Lakehouse Storage is disabled by default, you must enable it manually.
3838

3939
### Lakehouse Storage Cluster Configurations
40-
First, you must configure the lakehouse storage in `server.yaml`. Take Paimon
41-
as an example, you must configure the following configurations:
40+
#### Modify `server.yaml`
41+
First, you must configure the lakehouse storage in `server.yaml`. Take Paimon as an example, you must configure the following configurations:
4242
```yaml
43+
# Paimon configuration
4344
datalake.format: paimon
4445

4546
# the catalog config about Paimon, assuming using Filesystem catalog
4647
datalake.paimon.metastore: filesystem
4748
datalake.paimon.warehouse: /tmp/paimon_data_warehouse
4849
```
4950
50-
### Start The Datalake Tiering Service
51-
Then, you must start the datalake tiering service to compact Fluss's data to the lakehouse storage.
52-
To start the datalake tiering service, you must have a Flink cluster running since Fluss currently only supports Flink as a tiering service backend.
51+
Fluss processes Paimon configurations by removing the `datalake.paimon.` prefix and then use the remaining configuration (without the prefix `datalake.paimon.`) to create the Paimon catalog.
5352

54-
You can use the following commands to start the datalake tiering service:
55-
```shell
56-
# change directory to Fluss
57-
cd $FLUSS_HOME
53+
For example, if you want to configure to use Hive catalog, you can configure like following:
54+
```yaml
55+
datalake.format: paimon
56+
datalake.paimon.metastore: hive
57+
datalake.paimon.uri: thrift://<hive-metastore-host-name>:<port>
58+
datalake.paimon.warehouse: hdfs:///path/to/warehouse
59+
```
60+
#### Add other jars required by datalake
61+
While Fluss includes the core Paimon library, additional jars may still need to be manually added to `${FLUSS_HOME}/plugins/paimon/` according to your needs.
62+
For example, for OSS filesystem support, you need to put `paimon-oss`.jar into directory `${FLUSS_HOME}/plugins/paimon/`.
5863

59-
# start the tiering service, assuming rest endpoint is localhost:8081
60-
./bin/lakehouse.sh -Dflink.rest.address=localhost -Dflink.rest.port=8081
64+
### Start The Datalake Tiering Service
65+
Then, you must start the datalake tiering service to tier Fluss's data to the lakehouse storage.
66+
#### Prerequisites
67+
- A running Flink cluster (currently only Flink is supported as the tiering backend)
68+
- Download [fluss-flink-tiering-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/com/alibaba/fluss/fluss-flink-tiering/$FLUSS_VERSION$/fluss-flink-tiering-$FLUSS_VERSION$.jar)
69+
70+
#### Prepare required jars
71+
- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you should choose a connector version matching your Flink version. If you're using Flink 1.20, please use `fluss-flink-1.20-$FLUSS_VERSION$.jar`
72+
- If you use [Amazon S3](http://aws.amazon.com/s3/), [Aliyun OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote storage](maintenance/tiered-storage/remote-storage.md),
73+
you should download the corresponding [Fluss filesystem jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
74+
- Put [fluss-lake-paimon jar](https://repo1.maven.org/maven2/com/alibaba/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib`, currently only paimon is supported, so you can only choose `fluss-lake-paimon`
75+
- [Download](https://flink.apache.org/downloads/) pre-bundled Hadoop jar `flink-shaded-hadoop-2-uber-*.jar` and put into `${FLINK_HOME}/lib`
76+
- Put Paimon's filesystem jar into `${FLINK_HOME}/lib`, if you use s3 to store paimon data, please put `paimon-s3` jar into `${FLINK_HOME}/lib`
77+
- The other jars that Paimon may require, for example, if you use HiveCatalog, you will need to put hive related jars
78+
79+
80+
#### Start Datalake Tiering Service
81+
After the Flink Cluster has been started, you can execute the `fluss-flink-tiering-$FLUSS_VERSION$.jar` by using the following command to start datalake tiering service:
82+
```shell
83+
<FLINK_HOME>/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \
84+
--fluss.bootstrap.servers localhost:9123 \
85+
--datalake.format paimon \
86+
--datalake.paimon.metastore filesystem \
87+
--datalake.paimon.warehouse /tmp/paimon
6188
```
6289

6390
**Note:**
64-
- `flink.rest.address` and `flink.rest.port` are the Flink cluster's rest endpoint, you may need to change it according to your Flink cluster's configuration.
65-
- The datalake tiering service is actual a flink job, you can set the Flink configuration in `-D` arguments while starting the datalake tiering service, There are some example commands for reference below.
91+
- The `fluss.bootstrap.servers` should be the bootstrap server address of your Fluss cluster. You must configure all options with the `datalake.` prefix in the [server.yaml](#modify-serveryaml) file to run the tiering service.
92+
to run the tiering service. In this case, these parameters are `--datalake.format`, `--datalake.paimon.metastore`, and `--datalake.paimon.warehouse`.
93+
- The Flink tiering service is stateless, and you can run multiple tiering services simultaneously to tier tables in Fluss.
94+
These tiering services are coordinated by the Fluss cluster to ensure exactly-once semantics when tiering data to the lake storage. This means you can freely scale the service up or down according to your workload.
95+
- This follows the standard practice for [submitting jobs to Flink](https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/deployment/cli/), where you can use the `-D` parameter to specify Flink-related configurations.
96+
For example, if you want to set the tiering service job name to `My Fluss Tiering Service1` and use `3` as the job parallelism, you can use the following command:
6697
```shell
67-
# If want to set the checkpoint interval to 10s, you can use the following command to start the datalake tiering service
68-
./bin/lakehouse.sh -Dflink.rest.address=localhost -Dflink.rest.port=8081 -Dflink.execution.checkpointing.interval=10s
69-
70-
# By default, datalake tiering service synchronizes all the tables with datalake enabled to Lakehouse Storage.
71-
# To distribute the workload of the datalake tiering service through multiple Flink jobs,
72-
# you can specify the "database" parameter to synchronize only the datalake enabled tables in the specific database.
73-
./bin/lakehouse.sh -Dflink.rest.address=localhost -Dflink.rest.port=8081 -Ddatabase=fluss_\\w+
98+
<FLINK_HOME>/bin/flink run \
99+
-Dpipeline.name="My Fluss Tiering Service1" \
100+
-Dparallelism.default=3 \
101+
/path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \
102+
--fluss.bootstrap.servers localhost:9123 \
103+
--datalake.format paimon \
104+
--datalake.paimon.metastore filesystem \
105+
--datalake.paimon.warehouse /tmp/paimon
74106
```
75107

76108
### Enable Lakehouse Storage Per Table
77-
To enable lakehouse storage for a table, the table must be created with the option `'table.datalake.enabled' = 'true'`.
109+
To enable lakehouse storage for a table, the table must be created with the option `'table.datalake.enabled' = 'true'`.
110+
111+
Another option `table.datalake.freshness`, allows per-table configuration of data freshness in the datalake.
112+
It defines the maximum amount of time that the datalake table's content should lag behind updates to the Fluss table.
113+
Based on this target freshness, the Fluss tiering service automatically moves data from the Fluss table and updates to the datalake table, so that the data in the datalake table is kept up to date within this target.
114+
The default is `3min`, if the data does not need to be as fresh, you can specify a longer target freshness time to reduce costs.

0 commit comments

Comments
 (0)