Skip to content

Arrow schema issues for partition columns #74

@sebbegg

Description

@sebbegg

Hi,

hope this is an okay place to mention this...
I forked off this repo to play a bit with ballista+delta myself.

I built a small web api in front of the scheduler usign axum+axum-streams.
When yielding results as arrow ipc streams I came across one issue:
abdolence/axum-streams-rs#80

When yielding results from delta tables there seem to be cases where the schema as reported by the dataframe and the schema of individual batches don't match. E.g. in:

let df = ctx.sql("SELECT * FROM example").await.unwrap();

let batches = df.collect().await.unwrap();
let batch_schema = batches.get(0).unwrap().schema();

println!("df.schema = {:?}", df.schema().clone());
println!("batch_schema = {:?}", batch_schema);

In datafusion alone I cannot reproduce the error, only in combination with ballista. I guess this diff might occur due to ballista writing partition results to arrow & reading them again?

Glad about any insights !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions