|
| 1 | +# Working with Partitions in Delta Lake |
| 2 | + |
| 3 | +Partitions in Delta Lake let you organize data based on specific columns (for example, date columns or country columns). Partitioning can significantly speed up queries that filter on those columns, because unneeded partitions can be skipped entirely. |
| 4 | + |
| 5 | +Below, we demonstrate how to create, query, and update partitioned Delta tables, covering both Python and Rust examples. |
| 6 | + |
| 7 | + |
| 8 | +## Creating a Partitioned Table |
| 9 | + |
| 10 | +To create a partitioned Delta table, specify one or more partition columns when writing the data. If you’re using Python, pass `partition_by=[<column>]` to the [write_deltalake()][deltalake.write_deltalake] function. In Rust, you can use `with_partition_columns(...)` on the builder when creating the table. |
| 11 | + |
| 12 | +```python |
| 13 | +from deltalake import write_deltalake |
| 14 | +import pandas as pd |
| 15 | + |
| 16 | +df = pd.DataFrame({ |
| 17 | + "num": [1, 2, 3], |
| 18 | + "letter": ["a", "b", "c"], |
| 19 | + "country": ["US", "US", "CA"] |
| 20 | +}) |
| 21 | + |
| 22 | +# Create a table partitioned by the "country" column |
| 23 | +write_deltalake("tmp/partitioned-table", df, partition_by=["country"]) |
| 24 | +``` |
| 25 | +The structure in the “tmp/partitioned-table” folder is showing how Delta Lake organizes data by the partition column. The “_delta_log” folder holds transaction metadata, while each “country=<value>” subfolder contains the Parquet files for rows matching that partition value. This layout allows efficient queries and updates on partitioned data. |
| 26 | +```plaintext |
| 27 | +tmp/partitioned-table/ |
| 28 | +├── _delta_log/ |
| 29 | +│ └── 00000000000000000000.json |
| 30 | +├── country=CA/ |
| 31 | +│ └── part-00000-<uuid>.parquet |
| 32 | +├── country=US/ |
| 33 | +│ └── part-00001-<uuid>.parquet |
| 34 | +``` |
| 35 | + |
| 36 | +## Querying Partitioned Data |
| 37 | + |
| 38 | +### Filtering by partition columns |
| 39 | + |
| 40 | +Because partition columns are part of the storage path, queries that filter on those columns can skip reading unneeded partitions. You can specify partition filters when reading data with [DeltaTable.to_pandas()][deltalake.table.DeltaTable.to_pandas], [DeltaTable.to_pyarrow_table()][deltalake.table.DeltaTable.to_pyarrow_table], or [DeltaTable.to_pyarrow_dataset()][deltalake.table.DeltaTable.to_pyarrow_dataset]. |
| 41 | + |
| 42 | +```python |
| 43 | +from deltalake import DeltaTable |
| 44 | + |
| 45 | +dt = DeltaTable("tmp/partitioned-table") |
| 46 | + |
| 47 | +# Only read files from partitions where country = 'US' |
| 48 | +pdf = dt.to_pandas(partitions=[("country", "=", "US")]) |
| 49 | +print(pdf) |
| 50 | +``` |
| 51 | +```plaintext |
| 52 | + num letter country |
| 53 | +0 1 a US |
| 54 | +1 2 b US |
| 55 | +``` |
| 56 | + |
| 57 | +### Partition Columns in Table Metadata |
| 58 | + |
| 59 | +Partition columns can also be inspected via metadata: |
| 60 | + |
| 61 | +```python |
| 62 | +from deltalake import DeltaTable |
| 63 | + |
| 64 | +dt = DeltaTable("tmp/partitioned-table") |
| 65 | +print(dt.metadata().partition_columns) |
| 66 | +``` |
| 67 | + |
| 68 | +```plaintext |
| 69 | +['country'] |
| 70 | +``` |
| 71 | + |
| 72 | +## Appending and Overwriting Partitions |
| 73 | + |
| 74 | +### Appending to a Partitioned Table |
| 75 | + |
| 76 | +You can simply write additional data with mode="append" and the partition columns will be used to place data in the correct partition directories. |
| 77 | + |
| 78 | +```python |
| 79 | +new_data = pd.DataFrame({ |
| 80 | + "num": [10, 20, 30], |
| 81 | + "letter": ["x", "y", "z"], |
| 82 | + "country": ["CA", "DE", "DE"] |
| 83 | +}) |
| 84 | +from deltalake import write_deltalake |
| 85 | + |
| 86 | +write_deltalake("tmp/partitioned-table", new_data, mode="append") |
| 87 | +``` |
| 88 | + |
| 89 | +### Overwriting an Entire Partition |
| 90 | + |
| 91 | +You can overwrite a specific partition, leaving the other partitions intact. Pass in mode="overwrite" together with partition_filters. |
| 92 | +```python |
| 93 | +df_overwrite = pd.DataFrame({ |
| 94 | + "num": [900, 1000], |
| 95 | + "letter": ["m", "n"], |
| 96 | + "country": ["DE", "DE"] |
| 97 | +}) |
| 98 | + |
| 99 | +from deltalake import DeltaTable, write_deltalake |
| 100 | + |
| 101 | +dt = DeltaTable("tmp/partitioned-table") |
| 102 | +write_deltalake( |
| 103 | + dt, |
| 104 | + df_overwrite, |
| 105 | + partition_filters=[("country", "=", "DE")], |
| 106 | + mode="overwrite", |
| 107 | +) |
| 108 | +``` |
| 109 | +This will remove only the `country=DE` partition files and overwrite them with the new data. |
| 110 | + |
| 111 | +### Overwriting Parts of the Table Using a Predicate |
| 112 | + |
| 113 | +If you have a more fine-grained predicate than a partition filter, you can use the [predicate argument][deltalake.write_deltalake] (sometimes called replaceWhere) to overwrite only rows matching a specific condition. |
| 114 | + |
| 115 | +(See the “Overwriting part of the table data using a predicate” section in the Writing Delta Tables docs for more details.) |
| 116 | + |
| 117 | +## Updating Partitioned Tables with Merge |
| 118 | + |
| 119 | +You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones—simply provide a matching predicate that references partition columns if needed. |
| 120 | + |
| 121 | +For example, you can match on both the partition column (country) and some other condition: |
| 122 | +```python |
| 123 | +from deltalake import DeltaTable |
| 124 | +import pyarrow as pa |
| 125 | + |
| 126 | +dt = DeltaTable("tmp/partitioned-table") |
| 127 | + |
| 128 | +# Source data referencing an existing partition "US" |
| 129 | +source_data = pa.table({"num": [100, 101], "letter": ["A", "B"], "country": ["US", "US"]}) |
| 130 | + |
| 131 | +( |
| 132 | + dt.merge( |
| 133 | + source=source_data, |
| 134 | + predicate="target.country = source.country AND target.num = source.num", |
| 135 | + source_alias="source", |
| 136 | + target_alias="target" |
| 137 | + ) |
| 138 | + .when_matched_update( |
| 139 | + updates={"letter": "source.letter"} |
| 140 | + ) |
| 141 | + .when_not_matched_insert_all() |
| 142 | + .execute() |
| 143 | +) |
| 144 | +``` |
| 145 | + |
| 146 | +If the partition does not exist (say for a new country value), a new partition folder will be created automatically. |
| 147 | + |
| 148 | +(See more in the docs on merging tables.) |
| 149 | + |
| 150 | +## Query Optimizations with Partitions |
| 151 | + |
| 152 | +Partitions allow data skipping for queries that include the partition columns. For example, if your partition column is date, any query with a clause like WHERE date = '2023-01-01' or WHERE date >= '2023-01-01' AND date < '2023-01-10' can skip reading all files not in those partitions. |
| 153 | + |
| 154 | +You can confirm partition-based skipping by: |
| 155 | + |
| 156 | +```python |
| 157 | +dt = DeltaTable("path/to/table") |
| 158 | +df = dt.to_pandas(partitions=[("date", "=", "2023-01-01")]) |
| 159 | +``` |
| 160 | +Using pushdown predicates in DataFusion or DuckDB from Rust/Python. |
| 161 | +(See more details in the Querying Delta Tables docs.) |
| 162 | + |
| 163 | +## Deleting Partition Data |
| 164 | + |
| 165 | +You may want to delete all rows from a specific partition. For example: |
| 166 | +```python |
| 167 | +dt = DeltaTable("tmp/partitioned-table") |
| 168 | + |
| 169 | +# Delete all rows from the 'US' partition: |
| 170 | +dt.delete("country = 'US'") |
| 171 | +``` |
| 172 | +This command logically deletes the data by creating a new transaction. (See docs on deleting rows for more.) |
| 173 | + |
| 174 | +## Maintaining Partitioned Tables |
| 175 | + |
| 176 | +### Optimize & Vacuum |
| 177 | + |
| 178 | +Partitioned tables can suffer from many small files if frequently appended to. If needed, you can run optimize compaction on a specific partition: |
| 179 | +```python |
| 180 | +dt.optimize(partition_filters=[("country", "=", "US")]) |
| 181 | +``` |
| 182 | + |
| 183 | +Then optionally vacuum the table to remove older, unreferenced files. |
| 184 | + |
| 185 | +### Handling High-Cardinality Columns |
| 186 | + |
| 187 | +Partitioning can be very powerful, but be mindful of using high-cardinality columns (columns with too many unique values). This can create an excessive number of directories and can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has |
| 188 | + |
0 commit comments