Skip to content

Commit 49eeec1

Browse files
xx789633luoyuxialeonardBang
authored
[lake/lance] add documentation for Lance connector (#1587)
--------- Co-authored-by: maxcwang <[email protected]> Co-authored-by: luoyuxia <[email protected]> Co-authored-by: Leonard Xu <[email protected]>
1 parent 2c73e76 commit 49eeec1

File tree

3 files changed

+130
-3
lines changed

3 files changed

+130
-3
lines changed

website/docs/maintenance/tiered-storage/lakehouse-storage.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,10 @@ can gain much storage cost reduction and analytics performance improvement.
1818

1919
## Enable Lakehouse Storage
2020

21-
Lakehouse Storage is disabled by default, you must enable it manually.
21+
Lakehouse Storage is disabled by default, you must enable it manually.
22+
23+
The following example uses Paimon for demonstration—other data lake formats follow similar steps, but require different configuration settings and JAR files.
24+
You can refer to the documentation of the corresponding data lake format integration for required configurations and JAR files.
2225

2326
### Lakehouse Storage Cluster Configurations
2427
#### Modify `server.yaml`
@@ -55,7 +58,7 @@ Then, you must start the datalake tiering service to tier Fluss's data to the la
5558
- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you should choose a connector version matching your Flink version. If you're using Flink 1.20, please use [fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
5659
- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote storage](maintenance/tiered-storage/remote-storage.md),
5760
you should download the corresponding [Fluss filesystem jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
58-
- Put [fluss-lake-paimon jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib`, currently only paimon is supported, so you can only choose `fluss-lake-paimon`
61+
- Put [fluss-lake-paimon jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib`
5962
- [Download](https://flink.apache.org/downloads/) pre-bundled Hadoop jar `flink-shaded-hadoop-2-uber-*.jar` and put into `${FLINK_HOME}/lib`
6063
- Put Paimon's [filesystem jar](https://paimon.apache.org/docs/1.1/project/download/) into `${FLINK_HOME}/lib`, if you use s3 to store paimon data, please put `paimon-s3` jar into `${FLINK_HOME}/lib`
6164
- The other jars that Paimon may require, for example, if you use HiveCatalog, you will need to put hive related jars
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: Lance
3+
sidebar_position: 3
4+
---
5+
6+
# Lance
7+
8+
[Lance](https://lancedb.github.io/lance/) is a modern table format optimized for machine learning and AI applications.
9+
To integrate Fluss with Lance, you must enable lakehouse storage and configure Lance as the lakehouse storage. For more details, see [Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
10+
11+
## Configure Lance as LakeHouse Storage
12+
13+
### Configure Lance in Cluster Configurations
14+
15+
To configure Lance as the lakehouse storage, you must configure the following configurations in `server.yaml`:
16+
```yaml
17+
# Lance configuration
18+
datalake.format: lance
19+
20+
# Currently only local file system and object stores such as AWS S3 (and compatible stores) are supported as storage backends for Lance
21+
# To use S3 as Lance storage backend, you need to specify the following properties
22+
datalake.lance.warehouse: s3://<bucket>
23+
datalake.lance.endpoint: <endpoint>
24+
datalake.lance.allow_http: true
25+
datalake.lance.access_key_id: <access_key_id>
26+
datalake.lance.secret_access_key: <secret_access_key>
27+
28+
# Use local file system as Lance storage backend, you only need to specify the following property
29+
# datalake.lance.warehouse: /tmp/lance
30+
```
31+
32+
When a table is created or altered with the option `'table.datalake.enabled' = 'true'`, Fluss will automatically create a corresponding Lance table with path `<warehouse_path>/<database_name>/<table_name>.lance`.
33+
The schema of the Lance table matches that of the Fluss table.
34+
35+
```sql title="Flink SQL"
36+
USE CATALOG fluss_catalog;
37+
38+
CREATE TABLE fluss_order_with_lake (
39+
`order_id` BIGINT,
40+
`item_id` BIGINT,
41+
`amount` INT,
42+
`address` STRING
43+
) WITH (
44+
'table.datalake.enabled' = 'true',
45+
'table.datalake.freshness' = '30s'
46+
);
47+
```
48+
49+
### Start Tiering Service to Lance
50+
Then, you must start the datalake tiering service to tier Fluss's data to Lance. For guidance, you can refer to [Start The Datalake Tiering Service
51+
](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service). Although the example uses Paimon, the process is also applicable to Lance.
52+
53+
But in [Prepare required jars](maintenance/tiered-storage/lakehouse-storage.md#prepare-required-jars) step, you should follow this guidance:
54+
- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you should choose a connector version matching your Flink version. If you're using Flink 1.20, please use [fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
55+
- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote storage](maintenance/tiered-storage/remote-storage.md),
56+
you should download the corresponding [Fluss filesystem jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
57+
- Put [fluss-lake-lance jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-lance/$FLUSS_VERSION$/fluss-lake-lance-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib`
58+
59+
Additionally, when following the [Start Datalake Tiering Service](maintenance/tiered-storage/lakehouse-storage.md#start-datalake-tiering-service) guide, make sure to use Lance-specific configurations as parameters when starting the Flink tiering job:
60+
```shell
61+
<FLINK_HOME>/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \
62+
--fluss.bootstrap.servers localhost:9123 \
63+
--datalake.format lance \
64+
--datalake.lance.warehouse s3://<bucket> \
65+
--datalake.lance.endpoint <endpoint> \
66+
--datalake.lance.allow_http true \
67+
--datalake.lance.secret_access_key <secret_access_key> \
68+
--datalake.lance.access_key_id <access_key_id>
69+
```
70+
71+
> **NOTE**: Fluss v0.8 only supports tiering log tables to Lance.
72+
73+
Then, the datalake tiering service continuously tiers data from Fluss to Lance. The parameter `table.datalake.freshness` controls the frequency that Fluss writes data to Lance tables. By default, the data freshness is 3 minutes.
74+
75+
You can also specify Lance table properties when creating a datalake-enabled Fluss table by using the `lance.` prefix within the Fluss table properties clause.
76+
77+
```sql title="Flink SQL"
78+
CREATE TABLE fluss_order_with_lake (
79+
`order_id` BIGINT,
80+
`item_id` BIGINT,
81+
`amount` INT,
82+
`address` STRING
83+
) WITH (
84+
'table.datalake.enabled' = 'true',
85+
'table.datalake.freshness' = '30s',
86+
'lance.max_row_per_file' = '512'
87+
);
88+
```
89+
90+
For example, you can specify the property `max_row_per_file` to control the writing behavior when Fluss tiers data to Lance.
91+
92+
## Reading with Lance ecosystem tools
93+
94+
Since the data tiered to Lance from Fluss is stored as a standard Lance table, you can use any tool that supports Lance to read it. Below is an example using [pylance](https://pypi.org/project/pylance/):
95+
96+
```python title="Lance Python"
97+
import lance
98+
ds = lance.dataset("<warehouse_path>/<database_name>/<table_name>.lance")
99+
```
100+
101+
## Data Type Mapping
102+
103+
Lance internally stores data in Arrow format.
104+
When integrating with Lance, Fluss automatically converts between Fluss data types and Lance data types.
105+
The following table shows the mapping between [Fluss data types](table-design/data-types.md) and Lance data types:
106+
107+
| Fluss Data Type | Lance Data Type |
108+
|-------------------------------|-----------------|
109+
| BOOLEAN | Bool |
110+
| TINYINT | Int8 |
111+
| SMALLINT | Int16 |
112+
| INT | Int32 |
113+
| BIGINT | Int64 |
114+
| FLOAT | Float32 |
115+
| DOUBLE | Float64 |
116+
| DECIMAL | Decimal128 |
117+
| STRING | Utf8 |
118+
| CHAR | Utf8 |
119+
| DATE | Date |
120+
| TIME | Time |
121+
| TIMESTAMP | Timestamp |
122+
| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp |
123+
| BINARY | FixedSizeBinary |
124+
| BYTES | Binary |

website/docs/streaming-lakehouse/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,4 @@ Some powerful features it provided are:
4444
- **Analytical Streams**: The union reads help data streams to have the powerful analytics capabilities. This reduces complexity when developing streaming applications, simplifies debugging, and allows for immediate access to live data insights.
4545
- **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync with data lake catalogs while compacting data into Lakehouse. This allows external engines like Spark, StarRocks, Flink, Trino to read the data directly by connecting to the data lake catalog.
4646

47-
Currently, Fluss supports [Paimon as Lakehouse Storage](integrate-data-lakes/paimon.md), more kinds of data lake formats are on the roadmap.
47+
Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md) and [Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data lake formats are on the roadmap.

0 commit comments

Comments
 (0)