Skip to content

[Bug]: target_file_size not respected in rust ? #3881

@Matthieusalor

Description

@Matthieusalor

What happened?

I have been extensively using the python library for a while now and I try to uses this library directly in rust now however I'm having a hard time getting it to work the way I expect.

Every time I write data in rust I end up with files of size ~8192 rows and this happens whatever I put in the properties below. They are completely ignored. I'm probably missing something very obvious here but I can't figure it out from the documentation.

use std::sync::Arc;
use arrow::array::Int32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use deltalake_core::DeltaOps;
use anyhow::Result;
use url::Url;

#[tokio::main]
async fn main() -> Result<()> {
    // Generate 500,000 rows of data
    let num_rows = 500_000;
    let ids: Vec<i32> = (1..=num_rows).collect();

    // Create Arrow schema
    let id_field = Field::new("id", DataType::Int32, false);
    let schema = Arc::new(Schema::new(vec![id_field]));

    // Create Arrow array and RecordBatch
    let ids_array = Int32Array::from(ids);
    let batch = RecordBatch::try_new(schema, vec![Arc::new(ids_array)])?;

    println!("Created RecordBatch with {} rows", batch.num_rows());

    // Write to Delta Lake
    let table_path = "/path_to_delta_table/";
    std::fs::create_dir_all(table_path).ok();
    

    let ops = DeltaOps::try_from_uri(Url::parse(&format!("file://{}", table_path)).unwrap()).await?;
    let mut table = ops.write(vec![batch]);

    table = table.with_target_file_size(512_000_000);
    table = table.with_write_batch_size(512_000_000);
    table.await?;

    // Read back and check the files created
    let table_files = std::fs::read_dir(table_path)?;
    println!("Wrote {} files.", table_files.count());

    Ok(())
}

In python this works perfectly:

deltalake.write_deltalake(
    data=df,
    table_or_uri="deltalable",
    target_file_size=int(512e6)
)

Expected behavior

I should have only one file

Operating System

None

Binding

None

Bindings Version

deltalake-core = { version = "0.29.1"}

Steps to reproduce

No response

Relevant logs

Metadata

Metadata

Assignees

Labels

binding/rustIssues for the Rust crate

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions