-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Motivation
Currently, although we have implemented lakehouse storage and support Paimon as a lake storage, there're some flaws in the implementation:
-
The lakehouse storage is strongly coupled to Paimon which makie it hard to support other datalake formats
-
The implementation is not efficient, it leverages Flink job to compact Fluss's data to Paimon's data, and will read the Fluss's data as Flink's Row data and write the row data to Paimon. It'll need row conversion between fluss, flink, paimon and the data shuffle. Idealy, we won't need data shuffle, we can compact the files from a Fluss's bucket to Paimon's bukcet directly since we keep the same data distribution between Fluss and Paimon.
What's more, we are expected to write Paimon's parquet/orc files directly and commit manifest to speed the compaction. Considering Fluss use arrow(by default) as log format and there 's efficient conversion from Arrow to Parquet, the compaction can be more efficient.
Solution
- Design docs(EN): https://docs.google.com/document/d/1Ghw_Jb-yHztgGvO5OpRWgibmPClDivejp7UyLUgKxOc/edit?pli=1&tab=t.0
- Design docs(ZH): https://drive.google.com/file/d/1qzM2HYRVb-Z6uMlOjeP6ywFSriVLINy7/view?usp=drive_link