Skip to content

Compacting a delta table consistently undershoots the target_file_size, creating an unnecessary extra file #3855

@abhi-airspace-intelligence

Description

Environment

Delta-rs version: git latest

Binding: Rust

Environment:

  • Cloud provider: AWS
  • OS: Linux
  • Other: Writing Parquet 2.0 files with ZSTD

Bug

What happened:

When compacting a table, delta-rs seems to consistently undershoot the target file size, leading to it creating a file size of ~98MB and another file of roughly 1MB.

What you expected to happen:

It should write one file with 100MB, rather than splitting into a 98MB file and a ~1.8MB file.

How to reproduce it:

Working on a repro right now.

More details:

My guess is that this occurs only with Parquet V2 since usage of it in the wild is rare, and the naming indicates this codepath is getting hit https://github.com/delta-io/delta-rs/blob/main/crates/core/src/operations/write/writer.rs#L478-L486. It's entirely possible this is an upstream parquet-rs issue.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions