Skip to content

Keep arrow metadata in Delta Table metadata #1531

@j-bennet

Description

@j-bennet

Environment

Delta-rs version:

0.10.0

Binding:

Python

Environment:

  • Cloud provider:
  • OS: macOS
  • Other:

Bug

delta-rs loses metadata for parquet written with pandas (example data is attached).

test.parquet.zip

from deltalake import DeltaTable
import pyarrow.parquet as pq

if __name__ == "__main__":
    # read it back with delta-rs
    dt = DeltaTable("test.parquet")
    print("\nDeltaTable schema:")
    print(dt.schema().to_pyarrow().to_string())

    # read it back with pyarrow
    table = pq.read_table("test.parquet")
    print("\nPyarrow schema:")
    print(table.schema.to_string())

This outputs:

DeltaTable schema:
col2: string
col1: int32

Pyarrow schema:
col2: dictionary<values=string, indices=int32, ordered=0>
col1: dictionary<values=int32, indices=int32, ordered=0>
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 509

The schema metadata part in pyarrow.table is nowhere to be found in DeltaTable. Is it present, but not public? How can it be accessed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions