Skip to content

Zero-copy converting for a location with many parquet files to fuse engine table #7381

Open
@BohuTANG

Description

@BohuTANG
  • Load data in background: Users query as normal but copy data to databend cloud at the same time. Once load are ready, users can query in a more efficient way.

There is no COPY here, we can transform the parquet files to fuse engine files directly, for example:

Users can create a table:

CREATE table xx ... location='s3://<user-bucket-path>'  CONNECTION=...

If the location is parquet files and not created by fuse engine, we can query them in normal way:

  1. list all the parquet files
  2. query them without any optimization (Since it does not have fuse indexes)

If the user does some optimization like:

optimize table xx; -- this statement syntax is a demo

We can:

  1. create min/max and other all fuse indexes for the parquet files without loading them
  2. convert all parquet files as the fuse engine files, and store some metadata to metasrv

I think @dantengsky have some ideas on it.

Originally posted by @BohuTANG in #7211 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-storageArea: databend storageC-featureCategory: feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions