|
| 1 | +--- |
| 2 | +title: Lance |
| 3 | +sidebar_position: 3 |
| 4 | +--- |
| 5 | + |
| 6 | +# Lance |
| 7 | + |
| 8 | +[Lance](https://lancedb.github.io/lance/) is a modern table format optimized for machine learning and AI applications. |
| 9 | +To integrate Fluss with Lance, you must enable lakehouse storage and configure Lance as the lakehouse storage. For more details, see [Enable Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage). |
| 10 | + |
| 11 | +## Configure Lance as LakeHouse Storage |
| 12 | + |
| 13 | +### Configure Lance in Cluster Configurations |
| 14 | + |
| 15 | +To configure Lance as the lakehouse storage, you must configure the following configurations in `server.yaml`: |
| 16 | +```yaml |
| 17 | +# Lance configuration |
| 18 | +datalake.format: lance |
| 19 | + |
| 20 | +# Currently only local file system and object stores such as AWS S3 (and compatible stores) are supported as storage backends for Lance |
| 21 | +# To use S3 as Lance storage backend, you need to specify the following properties |
| 22 | +datalake.lance.warehouse: s3://<bucket> |
| 23 | +datalake.lance.endpoint: <endpoint> |
| 24 | +datalake.lance.allow_http: true |
| 25 | +datalake.lance.access_key_id: <access_key_id> |
| 26 | +datalake.lance.secret_access_key: <secret_access_key> |
| 27 | + |
| 28 | +# Use local file system as Lance storage backend, you only need to specify the following property |
| 29 | +# datalake.lance.warehouse: /tmp/lance |
| 30 | +``` |
| 31 | + |
| 32 | +When a table is created or altered with the option `'table.datalake.enabled' = 'true'`, Fluss will automatically create a corresponding Lance table with path `<warehouse_path>/<database_name>/<table_name>.lance`. |
| 33 | +The schema of the Lance table matches that of the Fluss table. |
| 34 | + |
| 35 | +```sql title="Flink SQL" |
| 36 | +USE CATALOG fluss_catalog; |
| 37 | + |
| 38 | +CREATE TABLE fluss_order_with_lake ( |
| 39 | + `order_id` BIGINT, |
| 40 | + `item_id` BIGINT, |
| 41 | + `amount` INT, |
| 42 | + `address` STRING |
| 43 | +) WITH ( |
| 44 | + 'table.datalake.enabled' = 'true', |
| 45 | + 'table.datalake.freshness' = '30s' |
| 46 | +); |
| 47 | +``` |
| 48 | + |
| 49 | +### Start Tiering Service to Lance |
| 50 | +Then, you must start the datalake tiering service to tier Fluss's data to Lance. For guidance, you can refer to [Start The Datalake Tiering Service |
| 51 | +](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service). Although the example uses Paimon, the process is also applicable to Lance. |
| 52 | + |
| 53 | +But in [Prepare required jars](maintenance/tiered-storage/lakehouse-storage.md#prepare-required-jars) step, you should follow this guidance: |
| 54 | +- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you should choose a connector version matching your Flink version. If you're using Flink 1.20, please use [fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar) |
| 55 | +- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote storage](maintenance/tiered-storage/remote-storage.md), |
| 56 | + you should download the corresponding [Fluss filesystem jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib` |
| 57 | +- Put [fluss-lake-lance jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-lance/$FLUSS_VERSION$/fluss-lake-lance-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib` |
| 58 | + |
| 59 | +Additionally, when following the [Start Datalake Tiering Service](maintenance/tiered-storage/lakehouse-storage.md#start-datalake-tiering-service) guide, make sure to use Lance-specific configurations as parameters when starting the Flink tiering job: |
| 60 | +```shell |
| 61 | +<FLINK_HOME>/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \ |
| 62 | + --fluss.bootstrap.servers localhost:9123 \ |
| 63 | + --datalake.format lance \ |
| 64 | + --datalake.lance.warehouse s3://<bucket> \ |
| 65 | + --datalake.lance.endpoint <endpoint> \ |
| 66 | + --datalake.lance.allow_http true \ |
| 67 | + --datalake.lance.secret_access_key <secret_access_key> \ |
| 68 | + --datalake.lance.access_key_id <access_key_id> |
| 69 | +``` |
| 70 | + |
| 71 | +> **NOTE**: Fluss v0.8 only supports tiering log tables to Lance. |
| 72 | +
|
| 73 | +Then, the datalake tiering service continuously tiers data from Fluss to Lance. The parameter `table.datalake.freshness` controls the frequency that Fluss writes data to Lance tables. By default, the data freshness is 3 minutes. |
| 74 | + |
| 75 | +You can also specify Lance table properties when creating a datalake-enabled Fluss table by using the `lance.` prefix within the Fluss table properties clause. |
| 76 | + |
| 77 | +```sql title="Flink SQL" |
| 78 | +CREATE TABLE fluss_order_with_lake ( |
| 79 | + `order_id` BIGINT, |
| 80 | + `item_id` BIGINT, |
| 81 | + `amount` INT, |
| 82 | + `address` STRING |
| 83 | + ) WITH ( |
| 84 | + 'table.datalake.enabled' = 'true', |
| 85 | + 'table.datalake.freshness' = '30s', |
| 86 | + 'lance.max_row_per_file' = '512' |
| 87 | +); |
| 88 | +``` |
| 89 | + |
| 90 | +For example, you can specify the property `max_row_per_file` to control the writing behavior when Fluss tiers data to Lance. |
| 91 | + |
| 92 | +## Reading with Lance ecosystem tools |
| 93 | + |
| 94 | +Since the data tiered to Lance from Fluss is stored as a standard Lance table, you can use any tool that supports Lance to read it. Below is an example using [pylance](https://pypi.org/project/pylance/): |
| 95 | + |
| 96 | +```python title="Lance Python" |
| 97 | +import lance |
| 98 | +ds = lance.dataset("<warehouse_path>/<database_name>/<table_name>.lance") |
| 99 | +``` |
| 100 | + |
| 101 | +## Data Type Mapping |
| 102 | + |
| 103 | +Lance internally stores data in Arrow format. |
| 104 | +When integrating with Lance, Fluss automatically converts between Fluss data types and Lance data types. |
| 105 | +The following table shows the mapping between [Fluss data types](table-design/data-types.md) and Lance data types: |
| 106 | + |
| 107 | +| Fluss Data Type | Lance Data Type | |
| 108 | +|-------------------------------|-----------------| |
| 109 | +| BOOLEAN | Bool | |
| 110 | +| TINYINT | Int8 | |
| 111 | +| SMALLINT | Int16 | |
| 112 | +| INT | Int32 | |
| 113 | +| BIGINT | Int64 | |
| 114 | +| FLOAT | Float32 | |
| 115 | +| DOUBLE | Float64 | |
| 116 | +| DECIMAL | Decimal128 | |
| 117 | +| STRING | Utf8 | |
| 118 | +| CHAR | Utf8 | |
| 119 | +| DATE | Date | |
| 120 | +| TIME | Time | |
| 121 | +| TIMESTAMP | Timestamp | |
| 122 | +| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp | |
| 123 | +| BINARY | FixedSizeBinary | |
| 124 | +| BYTES | Binary | |
0 commit comments