Skip to content

Commit 3f2e4b5

Browse files
Liam BranniganLiam Brannigan
Liam Brannigan
authored and
Liam Brannigan
committed
Add standalone page on working with partitions
Signed-off-by: Liam Brannigan <[email protected]>
1 parent 2f1a5ac commit 3f2e4b5

File tree

1 file changed

+219
-0
lines changed

1 file changed

+219
-0
lines changed

docs/usage/working-with-partitions.md

+219
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# Working with Partitions in Delta Lake
2+
3+
Partitions in Delta Lake let you organize data based on specific columns (for example, date columns or country columns). Partitioning can significantly speed up queries that filter on those columns, because unneeded partitions can be skipped entirely.
4+
5+
Below, we demonstrate how to create, query, and update partitioned Delta tables, covering both Python and Rust examples.
6+
7+
8+
## Creating a Partitioned Table
9+
10+
To create a partitioned Delta table, specify one or more partition columns when creating the table. Here we partition by the country column.
11+
```python
12+
from deltalake import write_deltalake,DeltaTable
13+
import pandas as pd
14+
15+
df = pd.DataFrame({
16+
"num": [1, 2, 3],
17+
"letter": ["a", "b", "c"],
18+
"country": ["US", "US", "CA"]
19+
})
20+
21+
# Create a table partitioned by the "country" column
22+
write_deltalake("tmp/partitioned-table", df, partition_by=["country"])
23+
```
24+
25+
The structure in the "tmp/partitioned-table" folder shows how Delta Lake organizes data by the partition column. The "_delta_log" folder holds transaction metadata, while each "country=<value>" subfolder contains the Parquet files for rows matching that partition value. This layout allows efficient queries and updates on partitioned data.
26+
```plaintext
27+
tmp/partitioned-table/
28+
├── _delta_log/
29+
│ └── 00000000000000000000.json
30+
├── country=CA/
31+
│ └── part-00000-<uuid>.parquet
32+
├── country=US/
33+
│ └── part-00001-<uuid>.parquet
34+
```
35+
36+
## Querying Partitioned Data
37+
38+
### Filtering by partition columns
39+
40+
Because partition columns are part of the storage path, queries that filter on those columns can skip reading unneeded partitions. You can specify partition filters when reading data with [DeltaTable.to_pandas()](../../delta_table/#deltalake.DeltaTable.to_pandas).
41+
42+
43+
In this example we restrict our query to the `country="US"` partition.
44+
```python
45+
dt = DeltaTable("tmp/partitioned-table")
46+
47+
pdf = dt.to_pandas(partitions=[("country", "=", "US")])
48+
print(pdf)
49+
```
50+
```plaintext
51+
num letter country
52+
0 1 a US
53+
1 2 b US
54+
```
55+
56+
### Partition Columns in Table Metadata
57+
58+
Partition columns can also be inspected via metadata on a `DeltaTable`.
59+
60+
```python
61+
dt = DeltaTable("tmp/partitioned-table")
62+
print(dt.metadata().partition_columns)
63+
```
64+
65+
```plaintext
66+
['country']
67+
```
68+
69+
## Appending and Overwriting Partitions
70+
71+
### Appending to a Partitioned Table
72+
73+
You can write additional data to partitions (or create new partitions) with `mode="append"` and the partition columns will be used to place data in the correct partition directories.
74+
75+
```python
76+
new_data = pd.DataFrame({
77+
"num": [10, 20, 30],
78+
"letter": ["x", "y", "z"],
79+
"country": ["CA", "DE", "DE"]
80+
})
81+
82+
write_deltalake("tmp/partitioned-table", new_data, mode="append")
83+
84+
dt = DeltaTable("tmp/partitioned-table")
85+
pdf = dt.to_pandas()
86+
print(pdf)
87+
```
88+
89+
```plaintext
90+
num letter country
91+
0 20 y DE
92+
1 30 z DE
93+
2 10 x CA
94+
3 3 c CA
95+
4 1 a US
96+
5 2 b US
97+
```
98+
99+
### Overwriting a Partition
100+
101+
To overwrite a specific partition or partitions set `mode="overwrite"` together with a predicate string that specifies
102+
which partitions are present in the new data. By setting the predicate `deltalake` is able to skip the other partitions.
103+
104+
In this example we overwrite the `DE` partition with new data.
105+
106+
```python
107+
df_overwrite = pd.DataFrame({
108+
"num": [900, 1000],
109+
"letter": ["m", "n"],
110+
"country": ["DE", "DE"]
111+
})
112+
113+
dt = DeltaTable("tmp/partitioned-table")
114+
write_deltalake(
115+
dt,
116+
df_overwrite,
117+
predicate="country = 'DE'",
118+
mode="overwrite",
119+
)
120+
121+
dt = DeltaTable("tmp/partitioned-table")
122+
pdf = dt.to_pandas()
123+
print(pdf)
124+
```
125+
126+
```plaintext
127+
num letter country
128+
0 900 m DE
129+
1 1000 n DE
130+
2 10 x CA
131+
3 3 c CA
132+
4 1 a US
133+
5 2 b US
134+
```
135+
136+
## Updating Partitioned Tables with Merge
137+
138+
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. If only a subset of existing partitions are present in the source (i.e. new) data then `deltalake` can skip reading the partitions not present in the source data. You can do this by providing a predicate that specifies which partition values are in the source data.
139+
140+
This example shows an upsert merge operation:
141+
- The merge condition (`predicate`) matches rows between source and target based on the partition column and specifies which partitions are present in the source data
142+
- If a match is found between a source row and a target row, the `"letter"` column is updated with the source data
143+
- Otherwise if no match is found for a source row then the row is inserted, creating a new partition if necessary
144+
145+
```python
146+
dt = DeltaTable("tmp/partitioned-table")
147+
148+
source_data = pd.DataFrame({"num": [1, 101], "letter": ["A", "B"], "country": ["US", "CH"]})
149+
150+
(
151+
dt.merge(
152+
source=source_data,
153+
predicate="target.country = source.country AND target.country in ('US','CH')",
154+
source_alias="source",
155+
target_alias="target"
156+
)
157+
.when_matched_update(
158+
updates={"letter": "source.letter"}
159+
)
160+
.when_not_matched_insert_all()
161+
.execute()
162+
)
163+
164+
dt = DeltaTable("tmp/partitioned-table")
165+
pdf = dt.to_pandas()
166+
print(pdf)
167+
```
168+
169+
```plaintext
170+
num letter country
171+
0 101 B CH
172+
1 1 A US
173+
2 2 A US
174+
3 900 m DE
175+
4 1000 n DE
176+
5 10 x CA
177+
6 3 c CA
178+
```
179+
180+
## Deleting Partition Data
181+
182+
You may want to delete all rows from a specific partition. For example:
183+
```python
184+
dt = DeltaTable("tmp/partitioned-table")
185+
186+
dt.delete("country = 'US'")
187+
188+
dt = DeltaTable("tmp/partitioned-table")
189+
pdf = dt.to_pandas()
190+
print(pdf)
191+
```
192+
193+
```plaintext
194+
num letter country
195+
0 101 B CH
196+
1 900 m DE
197+
2 1000 n DE
198+
3 10 x CA
199+
4 3 c CA
200+
```
201+
This command logically deletes the data by creating a new transaction.
202+
203+
## Maintaining Partitioned Tables
204+
205+
### Optimize & Vacuum
206+
207+
Partitioned tables can accummulate many small files if a partition is frequently appended to. You can compact these into larger files on a specific partition with [`optimize.compact`](../../delta_table/#deltalake.DeltaTable.optimize).
208+
209+
If we want to target compaction at specific partitions we can include partition filters.
210+
211+
```python
212+
dt.optimize.compact(partition_filters=[("country", "=", "CA")])
213+
```
214+
215+
Then optionally [`vacuum`](../../delta_table/#deltalake.DeltaTable.vacuum) the table to remove older, unreferenced files.
216+
217+
### Handling High-Cardinality Columns
218+
219+
Partitioning can be useful for reducing the time it takes to update and query a table, but be mindful of creating partitions against high-cardinality columns (columns with many unique values). Doing so can create an excessive number of partition directories which can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.

0 commit comments

Comments
 (0)