Allow specifying per-column encoding when writing delta lake tables

# Allow specifying per-column encoding to achieve ~95% disk space reduction

According to the [parquet specification docs ](https://parquet.apache.org/docs/file-format/data-pages/encodings/) there are many types of encodings available:

- [PLAIN](https://parquet.apache.org/docs/file-format/data-pages/encodings/#plain-plain--0)
- [RLE](https://parquet.apache.org/docs/file-format/data-pages/encodings/#run-length-encoding--bit-packing-hybrid-rle--3)
- [DELTA_BINARY_PACKED](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-encoding-delta_binary_packed--5)
- [DELTA_LENGTH_BYTE_ARRAY](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6)
- [DELTA_BYTE_ARRAY](https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-strings-delta_byte_array--7)
- [BYTE_STREAM_SPLIT](https://parquet.apache.org/docs/file-format/data-pages/encodings/#byte-stream-split-byte_stream_split--9)
- ...etc

A user should be able to specify/overwrite the default dictionary & RLE encoding currently used by delta-rs to specify a different encoding that is more suitable to their data. Alternatively or in addition, auto-detecting the encoding based on a sample of data would be quite nifty.

## Use Case

Large time-series data sets often have numeric (int) columns that minimally change between rows and benefit greatly from DELTA_BINARY_PACKED encoding.

## Example

Make some fake time-series data and use deltalake and apache pyarrow to write datasets:

```python
import pandas as pd
import numpy as np

from deltalake import write_deltalake

import pyarrow.parquet as pq
import pyarrow as pa

# Make some fake time series data
TOTAL_ROWS = 100_000_000
timestamps = pd.date_range(start=pd.Timestamp.now(), periods=TOTAL_ROWS, freq="5us")
timeline = np.linspace(0, len(timestamps), len(timestamps))
pat = pa.Table.from_pandas(
    pd.DataFrame(
        {
            # Microsecond timestamp
            "timestamp": (timestamps.astype("int") / 1000).astype("int"),
            # 3 decimals of precision, stored as int
            "timeseries_data": (
                np.round(
                    10 * np.sin(2 * np.pi * 50 * timeline),
                    3,
                )
                * 1000
            ).astype("int"),
            # 1 minute partitions
            "partition_label": timestamps.strftime("%Y%m%d_%H%M"),
        }
    )
)

output_path_normal = "example_deltalake"
write_deltalake(
    output_path_normal,
    data=pat,
    partition_by=["partition_label"],
    engine="rust",
    # Can't specify per-column encoding
)


output_path_delta_binary_packed_encoded = "example_pyarrow_delta_binary_packed_encoding"
pq.write_to_dataset(
    pat,
    output_path_delta_binary_packed_encoded,
    partition_cols=["partition_label"],
    # Ability to specify column encodings here
    column_encoding={
        "timestamp": "DELTA_BINARY_PACKED",
        "timeseries_data": "DELTA_BINARY_PACKED",
        "partition_label": "RLE",
    },
    use_dictionary=False,
)
```

The above produces parquet & delta datasets:

```
6.4M	example_pyarrow_delta_binary_packed_encoding
423M	example_deltalake
```

The dataset written with `DELTA_BINARY_PACKED` encoding is **98.5%** smaller!

## Related Issues & PRs

- https://github.com/delta-io/delta-rs/issues/1772
- https://github.com/delta-io/delta-rs/issues/3212
- https://github.com/delta-io/delta-rs/pull/3214



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Allow specifying per-column encoding when writing delta lake tables #3319

Allow specifying per-column encoding to achieve ~95% disk space reduction

Use Case

Example

Related Issues & PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Allow specifying per-column encoding when writing delta lake tables #3319

Description

Allow specifying per-column encoding to achieve ~95% disk space reduction

Use Case

Example

Related Issues & PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions