Skip to content

Make work Z-order is on strings with identical prefix of length >= 14 #2844

Open
@cjolowicz

Description

@cjolowicz

Environment

Delta-rs version:

0.19.1

Binding:
Python and Rust

Environment:

  • Cloud provider: local filesystem and R2
  • OS: Linux
  • Other:

Bug

What happened:

Apply z-order to a Delta Table on a column that contains strings with identical prefixes of at least 14 characters. The records in the new Parquet files retain their original order.

I initially witnessed this when z-ordering a large partition on ISO 8601 timestamps using delta-rs in Rust. I've since reproduced this with Python bindings and a small data frame using strings containing zero-padded integers (see repro below).

What you expected to happen:

The resulting Parquet files are ordered by the column specified for z-ordering.

How to reproduce it:

# test_zorder.py
import shutil
import pandas
from deltalake import write_deltalake, DeltaTable

def test_zorder() -> None:
    table = "a"
    field = "b"
    items = [f"{item:015}" for item in [2, 3, 1]]

    shutil.rmtree(table, ignore_errors=True)

    write_deltalake(table, pandas.DataFrame({field: items}))

    DeltaTable(table).optimize.z_order([field])

    sorted_items = DeltaTable(table).to_pyarrow_table().to_pydict()[field]

    assert sorted(items) == sorted_items

Run this with uv:

# caveat: this removes a directory named `a` from the current directory
uvx --with deltalake --with pandas pytest -vv test_zorder.py

Output:

========================= test session starts ==========================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0 -- /home/claudio/.cache/uv/archive-v0/A-uQ68p-4BWRUFltJ5Mv2/bin/python
cachedir: .pytest_cache
rootdir: ...
collected 1 item                                                       

test_zorder.py::test_zorder FAILED                               [100%]

=============================== FAILURES ===============================
_____________________________ test_zorder ______________________________

...
    
>       assert sorted(items) == sorted_items
E       AssertionError: assert ['000000000000001', '000000000000002', '000000000000003'] == ['000000000000002', '000000000000003', '000000000000001']
E         
E         At index 0 diff: '000000000000001' != '000000000000002'
E         
E         Full diff:
E           [
E         +     '000000000000001',
E               '000000000000002',
E               '000000000000003',
E         -     '000000000000001',
E           ]

test_zorder.py:20: AssertionError
======================= short test summary info ========================
FAILED test_zorder.py::test_zorder - AssertionError: assert ['000000000000001', '000000000000002', '000000000000003'] == ['000000000000002', '000000000000003', '000000000000001']
  
  At index 0 diff: '000000000000001' != '000000000000002'
  
  Full diff:
    [
  +     '000000000000001',
        '000000000000002',
        '000000000000003',
  -     '000000000000001',
    ]
========================== 1 failed in 0.36s ===========================

More details:

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions