Skip to content

Commit d7e13eb

Browse files
Liam BranniganLiam Brannigan
Liam Brannigan
authored and
Liam Brannigan
committed
Update file
Signed-off-by: Liam Brannigan <[email protected]>
1 parent cd4bef2 commit d7e13eb

File tree

1 file changed

+79
-40
lines changed

1 file changed

+79
-40
lines changed

docs/usage/working-with-partitions.md

+79-40
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ Below, we demonstrate how to create, query, and update partitioned Delta tables,
77

88
## Creating a Partitioned Table
99

10-
To create a partitioned Delta table, specify one or more partition columns when writing the data. If you’re using Python, pass `partition_by=[<column>]` to the [write_deltalake()][deltalake.write_deltalake] function. In Rust, you can use `with_partition_columns(...)` on the builder when creating the table.
11-
10+
To create a partitioned Delta table, specify one or more partition columns when creating the table. Here we partition by the country column.
1211
```python
1312
from deltalake import write_deltalake
1413
import pandas as pd
@@ -22,7 +21,8 @@ df = pd.DataFrame({
2221
# Create a table partitioned by the "country" column
2322
write_deltalake("tmp/partitioned-table", df, partition_by=["country"])
2423
```
25-
The structure in the “tmp/partitioned-table” folder is showing how Delta Lake organizes data by the partition column. The “_delta_log” folder holds transaction metadata, while each “country=<value>” subfolder contains the Parquet files for rows matching that partition value. This layout allows efficient queries and updates on partitioned data.
24+
25+
The structure in the "tmp/partitioned-table" folder shows how Delta Lake organizes data by the partition column. The "_delta_log" folder holds transaction metadata, while each "country=<value>" subfolder contains the Parquet files for rows matching that partition value. This layout allows efficient queries and updates on partitioned data.
2626
```plaintext
2727
tmp/partitioned-table/
2828
├── _delta_log/
@@ -37,14 +37,13 @@ tmp/partitioned-table/
3737

3838
### Filtering by partition columns
3939

40-
Because partition columns are part of the storage path, queries that filter on those columns can skip reading unneeded partitions. You can specify partition filters when reading data with [DeltaTable.to_pandas()][deltalake.table.DeltaTable.to_pandas], [DeltaTable.to_pyarrow_table()][deltalake.table.DeltaTable.to_pyarrow_table], or [DeltaTable.to_pyarrow_dataset()][deltalake.table.DeltaTable.to_pyarrow_dataset].
40+
Because partition columns are part of the storage path, queries that filter on those columns can skip reading unneeded partitions. You can specify partition filters when reading data with [DeltaTable.to_pandas()](../../delta_table/#deltalake.DeltaTable.to_pandas).
4141

42-
```python
43-
from deltalake import DeltaTable
4442

43+
In this example we restrict our query to the `country="US"` partition.
44+
```python
4545
dt = DeltaTable("tmp/partitioned-table")
4646

47-
# Only read files from partitions where country = 'US'
4847
pdf = dt.to_pandas(partitions=[("country", "=", "US")])
4948
print(pdf)
5049
```
@@ -56,11 +55,9 @@ print(pdf)
5655

5756
### Partition Columns in Table Metadata
5857

59-
Partition columns can also be inspected via metadata:
58+
Partition columns can also be inspected via metadata on a `DeltaTable`.
6059

6160
```python
62-
from deltalake import DeltaTable
63-
6461
dt = DeltaTable("tmp/partitioned-table")
6562
print(dt.metadata().partition_columns)
6663
```
@@ -73,65 +70,80 @@ print(dt.metadata().partition_columns)
7370

7471
### Appending to a Partitioned Table
7572

76-
You can simply write additional data with mode="append" and the partition columns will be used to place data in the correct partition directories.
73+
You can write additional data to partitions (or create new partitions) with `mode="append"` and the partition columns will be used to place data in the correct partition directories.
7774

7875
```python
7976
new_data = pd.DataFrame({
8077
"num": [10, 20, 30],
8178
"letter": ["x", "y", "z"],
8279
"country": ["CA", "DE", "DE"]
8380
})
84-
from deltalake import write_deltalake
8581

8682
write_deltalake("tmp/partitioned-table", new_data, mode="append")
83+
84+
dt = DeltaTable("tmp/partitioned-table")
85+
pdf = dt.to_pandas()
86+
print(pdf)
8787
```
8888

89-
### Overwriting an Entire Partition
89+
```plaintext
90+
num letter country
91+
0 20 y DE
92+
1 30 z DE
93+
2 10 x CA
94+
3 3 c CA
95+
4 1 a US
96+
5 2 b US
97+
```
98+
99+
### Overwriting a Partition
100+
101+
You can overwrite a specific partition, leaving the other partitions intact. Pass in `mode="overwrite"` together with a predicate string.
102+
103+
In this example we overwrite the `DE` paritition with new data.
90104

91-
You can overwrite a specific partition, leaving the other partitions intact. Pass in mode="overwrite" together with partition_filters.
92105
```python
93106
df_overwrite = pd.DataFrame({
94107
"num": [900, 1000],
95108
"letter": ["m", "n"],
96109
"country": ["DE", "DE"]
97110
})
98111

99-
from deltalake import DeltaTable, write_deltalake
100-
101112
dt = DeltaTable("tmp/partitioned-table")
102113
write_deltalake(
103114
dt,
104115
df_overwrite,
105-
partition_filters=[("country", "=", "DE")],
116+
predicate="country = 'DE'",
106117
mode="overwrite",
107118
)
108-
```
109-
This will remove only the `country=DE` partition files and overwrite them with the new data.
110119

111-
### Overwriting Parts of the Table Using a Predicate
112-
113-
If you have a more fine-grained predicate than a partition filter, you can use the [predicate argument][deltalake.write_deltalake] (sometimes called replaceWhere) to overwrite only rows matching a specific condition.
120+
dt = DeltaTable("tmp/partitioned-table")
121+
pdf = dt.to_pandas()
122+
print(pdf)
123+
```
114124

115-
(See the “Overwriting part of the table data using a predicate” section in the Writing Delta Tables docs for more details.)
125+
```plaintext
126+
num letter country
127+
0 900 m DE
128+
1 1000 n DE
129+
2 10 x CA
130+
3 3 c CA
131+
4 1 a US
132+
5 2 b US
133+
```
116134

117135
## Updating Partitioned Tables with Merge
118136

119137
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. Simply provide a matching predicate that references partition columns if needed.
120138

121-
You can match on both the partition column (country) and some other condition. This example shows a merge operation that checks both the partition column (“country”) and a numeric column (“num”) when merging:
122-
- The table is partitioned by “country,” so underlying data is physically split by each country value.
123-
- The merge condition (predicate) matches target rows where both “country” and “num” align with the source.
124-
- When a match occurs, it updates “letter”; otherwise, it inserts the new row.
125-
- This approach ensures that only rows in the relevant partition (“US”) are processed, keeping operations efficient.
139+
You can match on both the partition column (country) and some other condition. This example shows a merge operation that checks both the partition column ("country") and a numeric column ("num") when merging:
140+
- The merge condition (predicate) matches target rows where both "country" and "num" align with the source.
141+
- When a match occurs, it updates the "letter" column; otherwise, it inserts the new row.
126142

127143
```python
128-
from deltalake import DeltaTable
129-
import pyarrow as pa
130-
131144
dt = DeltaTable("tmp/partitioned-table")
132145

133-
# New data that references an existing partition "US"
134-
source_data = pa.table({"num": [1, 101], "letter": ["A", "B"], "country": ["US", "US"]})
146+
source_data = pd.DataFrame({"num": [1, 101], "letter": ["A", "B"], "country": ["US", "US"]})
135147

136148
(
137149
dt.merge(
@@ -146,31 +158,58 @@ source_data = pa.table({"num": [1, 101], "letter": ["A", "B"], "country": ["US",
146158
.when_not_matched_insert_all()
147159
.execute()
148160
)
161+
162+
dt = DeltaTable("tmp/partitioned-table")
163+
pdf = dt.to_pandas()
164+
print(pdf)
165+
```
166+
167+
```plaintext
168+
num letter country
169+
0 101 B US
170+
1 1 A US
171+
2 2 b US
172+
3 900 m DE
173+
4 1000 n DE
174+
5 10 x CA
175+
6 3 c CA
149176
```
150177

178+
This approach ensures that only rows in the relevant partition ("US") are processed, keeping operations efficient.
179+
151180
## Deleting Partition Data
152181

153182
You may want to delete all rows from a specific partition. For example:
154183
```python
155184
dt = DeltaTable("tmp/partitioned-table")
156185

157-
# Delete all rows from the 'US' partition:
158186
dt.delete("country = 'US'")
187+
188+
dt = DeltaTable("tmp/partitioned-table")
189+
pdf = dt.to_pandas()
190+
print(pdf)
191+
```
192+
193+
```plaintext
194+
num letter country
195+
0 900 m DE
196+
1 1000 n DE
197+
2 10 x CA
198+
3 3 c CA
159199
```
160200
This command logically deletes the data by creating a new transaction.
161201

162202
## Maintaining Partitioned Tables
163203

164204
### Optimize & Vacuum
165205

166-
Partitioned tables can accummulate many small files if a partition is frequently appended to. You can compact these into larger files on a specific partition:
206+
Partitioned tables can accummulate many small files if a partition is frequently appended to. You can compact these into larger files on a specific partition with [`optimize.compact`](../../delta_table/#deltalake.DeltaTable.optimize).
167207
```python
168-
dt.optimize(partition_filters=[("country", "=", "US")])
169-
```
208+
dt.optimize.compact(partition_filters=[("country", "=", "CA")])
209+
```
170210

171-
Then optionally vacuum the table to remove older, unreferenced files.
211+
Then optionally [`vacuum`](../../delta_table/#deltalake.DeltaTable.vacuum) the table to remove older, unreferenced files.
172212

173213
### Handling High-Cardinality Columns
174214

175-
Partitioning can be very powerful, but be mindful of using high-cardinality columns (columns with too many unique values). This can create an excessive number of directories and can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has
176-
215+
Partitioning can be very powerful, but be mindful of using high-cardinality columns (columns with too many unique values). This can create an excessive number of directories and can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.

0 commit comments

Comments
 (0)