Skip to content

feat(delta-lake): support column mapping for reads#7005

Open
aaron-ang wants to merge 2 commits into
Eventual-Inc:mainfrom
aaron-ang:delta-kernel
Open

feat(delta-lake): support column mapping for reads#7005
aaron-ang wants to merge 2 commits into
Eventual-Inc:mainfrom
aaron-ang:delta-kernel

Conversation

@aaron-ang
Copy link
Copy Markdown
Contributor

@aaron-ang aaron-ang commented May 23, 2026

Changes Made

Read Delta Lake tables with delta.columnMapping.mode = id or name. Previously raised at scan-op construction.

How it works.
Builds a field_id → PyField(logical_name, dtype) map from delta-rs's column metadata and threads it through ParquetSourceConfig.field_id_mapping, reusing the Parquet field-id rename path originally added for Iceberg.
Stats columns from add_actions (keyed by physical names) are translated to logical names via a top-level physical → logical lookup.

ARROW:schema fix.
After a field-id rename the embedded ARROW:schema hint still carries physical names and disagrees with the renamed parquet type tree, so arrow-rs errors with incompatible arrow schema, expected field named <physical> got <logical>.
Strip the hint so arrow-rs re-infers from the renamed parquet types. rebuild_file_metadata gains a strip_arrow_schema: bool flag; the raw-string strip path passes false (no rename, hint stays valid).

Test fixture.
Hand-crafts _delta_log JSON + parquet because deltalake 1.5.x cannot enable column mapping through the Python writer (CommitFailedError: Unsupported table features required: [ColumnMapping]).

Coverage.
Both id and name modes across flat reads, projection, partitioning, top-level nested struct (exercises walk recursion + nested rename), and a malformed-table error path.

Related Issues

Closes #1955

Read Delta tables with `delta.columnMapping.mode = id|name` by routing the
existing Parquet field-id rename path through delta-rs's column metadata.

- `_iter_mapped_fields` walks the Delta schema (incl. nested structs/arrays/maps)
  and builds a `field_id -> PyField(logical_name, dtype)` map, passed to
  `ParquetSourceConfig.field_id_mapping` so the Rust reader renames physical
  parquet columns to logical names at read time.
- Stats columns from `add_actions` (keyed by physical name) get translated to
  logical names via a top-level `physical -> logical` map.
- Strip the embedded `ARROW:schema` hint after a field-id rename
  (`metadata.rs`): the hint still carries physical names and disagrees with the
  renamed parquet type tree, causing arrow-rs to fail with
  `incompatible arrow schema, expected field named <physical> got <logical>`.
- Tests hand-craft `_delta_log` JSON + parquet (delta-rs 1.5.x cannot enable
  columnMapping via the Python writer). Covers `id` / `name` modes across
  flat, projection, partitioned, nested-struct, and malformed-table cases.

Closes Eventual-Inc#1955
@aaron-ang aaron-ang requested a review from a team as a code owner May 23, 2026 21:09
@github-actions github-actions Bot added the feat label May 23, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR enables reading Delta Lake tables with delta.columnMapping.mode = id or name by threading a field_id → PyField(logical_name, dtype) map through ParquetSourceConfig.field_id_mapping, reusing the existing Iceberg field-ID rename path. It also strips the ARROW:schema hint from parquet metadata after renaming to prevent arrow-rs from rejecting the renamed type tree, and translates physical stats column names to logical names in the scan task loop.

  • delta_lake_scan.py: Builds column-mapping lookups at construction and passes field_id_mapping to each ParquetSourceConfig; min/max stats keyed by physical names are renamed via _stats_physical_to_logical.
  • metadata.rs: rebuild_file_metadata gains a strip_arrow_schema flag; the field-ID rename path passes true, the string-stripping path passes false.
  • test_column_mapping.py: Hand-crafted fixtures cover flat reads, projection, nested struct, partitioning, and the malformed-table error path for both id and name modes.

Confidence Score: 5/5

Safe to merge; changes are well-scoped to column-mapped tables and reuse the existing Iceberg field-ID rename path without touching non-column-mapped reads.

Column-mapping logic is correctly isolated behind a cm_mode guard, existing table reads are unaffected, and the ARROW:schema stripping flag is explicitly threaded so it cannot regress the raw-string or Iceberg paths.

The stats physical→logical translation in delta_lake_scan.py is the one path not exercised by any test fixture; worth revisiting when stats-aware fixtures are added.

Important Files Changed

Filename Overview
daft/io/delta_lake/delta_lake_scan.py Adds column-mapping support: builds field_id→PyField and physical→logical maps at construction, passes field_id_mapping to ParquetSourceConfig per scan task, and translates stats column names from physical to logical.
src/daft-parquet/src/metadata.rs Adds strip_arrow_schema flag to rebuild_file_metadata; field-ID rename path passes true to drop the ARROW:schema hint, while string-stripping path passes false preserving existing behaviour.
tests/integration/delta_lake/test_column_mapping.py New integration tests hand-craft Delta log + parquet fixtures for both id and name modes; min/max stats translation path is not exercised because add actions lack statistics.

Sequence Diagram

sequenceDiagram
    participant User
    participant DeltaLakeScanOperator
    participant delta_rs
    participant ParquetReader
    participant metadata_rs

    User->>DeltaLakeScanOperator: read_deltalake(path)
    DeltaLakeScanOperator->>delta_rs: DeltaTable(path)
    delta_rs-->>DeltaLakeScanOperator: table + schema
    DeltaLakeScanOperator->>DeltaLakeScanOperator: _column_mapping_maps(schema)
    DeltaLakeScanOperator->>delta_rs: get_add_actions()
    delta_rs-->>DeltaLakeScanOperator: add_actions (stats keyed by physical names)
    loop per add action
        DeltaLakeScanOperator->>DeltaLakeScanOperator: translate stats physical to logical
        DeltaLakeScanOperator-->>User: ScanTask with field_id_mapping
    end
    User->>ParquetReader: execute ScanTask
    ParquetReader->>metadata_rs: apply_field_ids_to_arrowrs_parquet_metadata
    metadata_rs->>metadata_rs: rewrite parquet type tree
    metadata_rs->>metadata_rs: strip ARROW schema hint
    metadata_rs-->>ParquetReader: renamed ParquetMetaData
    ParquetReader-->>User: DataFrame with logical column names
Loading

Reviews (2): Last reviewed commit: "style: hoist inline imports to top of fi..." | Re-trigger Greptile

Comment on lines +43 to +78
def _delta_field_to_pyfield(field: deltalake.schema.Field) -> PyField:
"""Convert a Delta `Field` to a Daft `PyField` carrying its logical name and real dtype."""
from deltalake.schema import Schema

from daft.io.delta_lake._deltalake import delta_schema_to_pyarrow

pa_field = delta_schema_to_pyarrow(Schema([field])).field(0)
return PyField.create(field.name, DataType.from_arrow_type(pa_field.type)._dtype)


def _iter_mapped_fields(schema: deltalake.Schema) -> Iterator[deltalake.schema.Field]:
"""Yield every Delta `Field` in the schema that carries column-mapping metadata.

Per Delta protocol, list elements / map keys / map values are anonymous (no
`columnMapping.id`), but their *element type* may still contain mapped struct
fields (e.g. `array<struct<...>>`). We recurse through container types but only
yield fields that actually carry the mapping metadata.
"""
from deltalake.schema import ArrayType, MapType, StructType

def walk_type(t: object) -> Iterator[deltalake.schema.Field]:
if isinstance(t, StructType):
for sub in t.fields:
if (sub.metadata or {}).get(_CM_ID_KEY) is not None:
yield sub
yield from walk_type(sub.type)
elif isinstance(t, ArrayType):
yield from walk_type(t.element_type)
elif isinstance(t, MapType):
yield from walk_type(t.key_type)
yield from walk_type(t.value_type)

for field in schema.fields:
if (field.metadata or {}).get(_CM_ID_KEY) is not None:
yield field
yield from walk_type(field.type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inline imports inside module-level functions

_delta_field_to_pyfield and _iter_mapped_fields both place import statements inside their function bodies. Per the project's import style guide, imports should live at the top of the file rather than inline within functions or methods. The deltalake package is already imported at the module level on line 10, so ArrayType, MapType, StructType, and Schema from its sub-modules can be moved up. delta_schema_to_pyarrow from ._deltalake can also be hoisted (it's already imported inline in __init__ as a pre-existing violation, but new code should not extend the pattern). If deferring is intentional for circular-import reasons, a comment should explain it.

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +226 to +228
"type": "long",
"nullable": True,
"metadata": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inline import inside test function

from deltalake import DeltaTable appears inside the test body rather than at the top of the file. Even though pytest.importorskip("deltalake") guards the test, importorskip both skips and returns the module, so the import can be pulled to the module level or obtained from the importorskip return value. Having the import inline violates the project import style guide that applies here as well.

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@madvart madvart requested a review from cckellogg May 26, 2026 20:59
@cckellogg
Copy link
Copy Markdown
Contributor

@greptileai

Copy link
Copy Markdown
Contributor

@cckellogg cckellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical→logical min/max stats rename (delta_lake_scan.py:361-384) isn't exercised by any test — none of the fixtures emit stats in the add actions, so that block never runs in CI. Could you add a predicate-pushdown test on a column-mapped table Two data files with disjoint ranges where(df["col"] == ) and assert the result.

const FOOTER_SIZE: usize = 8;

/// Parquet key-value metadata key for the embedded Arrow schema hint written by
/// pyarrow/arrow-cpp. See `parquet-format-thrift` KeyValue.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other arrow clients write this as well?


actions = [
{
"protocol": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why versions 2 and 5?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Catalogs] [Delta Lake] Add support for reading tables with column mappings

2 participants