Skip to content

deletion of .json files in _delta_log directory #3096

@Curricane

Description

@Curricane

Environment

Delta-rs version: v0.23.0

Binding:

Environment:

  • Cloud provider:
  • OS:centos8
  • Other:

Bug

We are encountering performance issues due to the large number of .json files generated in the _delta_log directory when using Delta Lake with delta-rs. These files, which represent transaction logs, have grown significantly over time in our setup.

The excessive number of small .json files is causing the following issues:

Increased overhead in file system operations (e.g., scanning and metadata retrieval).
Slower table initialization and query performance, especially when the transaction logs are not regularly cleaned.
Significant performance degradation in environments with object stores or distributed file systems (e.g., S3, HDFS, OSS).
We are looking for an efficient way to quickly delete or clean up these .json files to improve the overall performance of the Delta table.

I try to vaccum them by the code

async fn vacuum(delta_table: deltalake::DeltaTable) {
    let snapshot = delta_table.snapshot().unwrap();
    match VacuumBuilder::new(delta_table.log_store(), snapshot.clone())
        .with_retention_period(chrono::Duration::zero())
        .with_enforce_retention_duration(false)
        .with_commit_properties(CommitProperties::default().with_cleanup_expired_logs(Some(true)))
        .await
    {
        Ok((_, metrics)) => println!("vacuum metrics: {:?}", metrics),
        Err(e) => println!("vacuum error: {:?}", e),
    }
}

What happened:
.json still here
What you expected to happen:
delete some .json file if checkponit exists
How to reproduce it:

More details:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmre-neededWhether an MRE needs to be providedquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions