-
Couldn't load subscription status.
- Fork 537
Description
Allow specifying per-column encoding to achieve ~95% disk space reduction
According to the parquet specification docs there are many types of encodings available:
A user should be able to specify/overwrite the default dictionary & RLE encoding currently used by delta-rs to specify a different encoding that is more suitable to their data. Alternatively or in addition, auto-detecting the encoding based on a sample of data would be quite nifty.
Use Case
Large time-series data sets often have numeric (int) columns that minimally change between rows and benefit greatly from DELTA_BINARY_PACKED encoding.
Example
Make some fake time-series data and use deltalake and apache pyarrow to write datasets:
import pandas as pd
import numpy as np
from deltalake import write_deltalake
import pyarrow.parquet as pq
import pyarrow as pa
# Make some fake time series data
TOTAL_ROWS = 100_000_000
timestamps = pd.date_range(start=pd.Timestamp.now(), periods=TOTAL_ROWS, freq="5us")
timeline = np.linspace(0, len(timestamps), len(timestamps))
pat = pa.Table.from_pandas(
pd.DataFrame(
{
# Microsecond timestamp
"timestamp": (timestamps.astype("int") / 1000).astype("int"),
# 3 decimals of precision, stored as int
"timeseries_data": (
np.round(
10 * np.sin(2 * np.pi * 50 * timeline),
3,
)
* 1000
).astype("int"),
# 1 minute partitions
"partition_label": timestamps.strftime("%Y%m%d_%H%M"),
}
)
)
output_path_normal = "example_deltalake"
write_deltalake(
output_path_normal,
data=pat,
partition_by=["partition_label"],
engine="rust",
# Can't specify per-column encoding
)
output_path_delta_binary_packed_encoded = "example_pyarrow_delta_binary_packed_encoding"
pq.write_to_dataset(
pat,
output_path_delta_binary_packed_encoded,
partition_cols=["partition_label"],
# Ability to specify column encodings here
column_encoding={
"timestamp": "DELTA_BINARY_PACKED",
"timeseries_data": "DELTA_BINARY_PACKED",
"partition_label": "RLE",
},
use_dictionary=False,
)The above produces parquet & delta datasets:
6.4M example_pyarrow_delta_binary_packed_encoding
423M example_deltalake
The dataset written with DELTA_BINARY_PACKED encoding is 98.5% smaller!