Skip to content

Allow specifying per-column encoding when writing delta lake tables #3319

@ghost

Description

Allow specifying per-column encoding to achieve ~95% disk space reduction

According to the parquet specification docs there are many types of encodings available:

A user should be able to specify/overwrite the default dictionary & RLE encoding currently used by delta-rs to specify a different encoding that is more suitable to their data. Alternatively or in addition, auto-detecting the encoding based on a sample of data would be quite nifty.

Use Case

Large time-series data sets often have numeric (int) columns that minimally change between rows and benefit greatly from DELTA_BINARY_PACKED encoding.

Example

Make some fake time-series data and use deltalake and apache pyarrow to write datasets:

import pandas as pd
import numpy as np

from deltalake import write_deltalake

import pyarrow.parquet as pq
import pyarrow as pa

# Make some fake time series data
TOTAL_ROWS = 100_000_000
timestamps = pd.date_range(start=pd.Timestamp.now(), periods=TOTAL_ROWS, freq="5us")
timeline = np.linspace(0, len(timestamps), len(timestamps))
pat = pa.Table.from_pandas(
    pd.DataFrame(
        {
            # Microsecond timestamp
            "timestamp": (timestamps.astype("int") / 1000).astype("int"),
            # 3 decimals of precision, stored as int
            "timeseries_data": (
                np.round(
                    10 * np.sin(2 * np.pi * 50 * timeline),
                    3,
                )
                * 1000
            ).astype("int"),
            # 1 minute partitions
            "partition_label": timestamps.strftime("%Y%m%d_%H%M"),
        }
    )
)

output_path_normal = "example_deltalake"
write_deltalake(
    output_path_normal,
    data=pat,
    partition_by=["partition_label"],
    engine="rust",
    # Can't specify per-column encoding
)


output_path_delta_binary_packed_encoded = "example_pyarrow_delta_binary_packed_encoding"
pq.write_to_dataset(
    pat,
    output_path_delta_binary_packed_encoded,
    partition_cols=["partition_label"],
    # Ability to specify column encodings here
    column_encoding={
        "timestamp": "DELTA_BINARY_PACKED",
        "timeseries_data": "DELTA_BINARY_PACKED",
        "partition_label": "RLE",
    },
    use_dictionary=False,
)

The above produces parquet & delta datasets:

6.4M	example_pyarrow_delta_binary_packed_encoding
423M	example_deltalake

The dataset written with DELTA_BINARY_PACKED encoding is 98.5% smaller!

Related Issues & PRs

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions