Skip to content

[Feature] Refactor Lakehouse Storage Implementation #107

@luoyuxia

Description

@luoyuxia

Motivation

Currently, although we have implemented lakehouse storage and support Paimon as a lake storage, there're some flaws in the implementation:

  1. The lakehouse storage is strongly coupled to Paimon which makie it hard to support other datalake formats

  2. The implementation is not efficient, it leverages Flink job to compact Fluss's data to Paimon's data, and will read the Fluss's data as Flink's Row data and write the row data to Paimon. It'll need row conversion between fluss, flink, paimon and the data shuffle. Idealy, we won't need data shuffle, we can compact the files from a Fluss's bucket to Paimon's bukcet directly since we keep the same data distribution between Fluss and Paimon.
    What's more, we are expected to write Paimon's parquet/orc files directly and commit manifest to speed the compaction. Considering Fluss use arrow(by default) as log format and there 's efficient conversion from Arrow to Parquet, the compaction can be more efficient.

Solution

Umbrella Tasks

fluss-server

fluss-lake/fluss-lake-common

fluss-lake/fluss-lake-format-paimon

fluss-lake/fluss-lake-format-iceberg

fluss-lake/fluss-lake-format-hudi

fluss-lake/fluss-lake-format-delta

fluss-lake/fluss-lake-format-lance

fluss-lake/fluss-lake-tiering-flink

fluss-connectors/fluss-connector-flink

Remove legacy module [fluss-lakehouse]

Documentation

Sub-issues

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions