Predicate pull-up optimization

https://duckdb.org/2024/11/14/optimizers.html#filter-pull-up--filter-pushdown has a nice description of filter pull up, an optimization in DuckDB that I'd like to implement in dask-expr as a learning exercise. dask-expr currently implements predicate push down, where a `read_parquet` followed by a filter on the values of a column is translated into a `read_parquet` with the `filters` set appropriately:

```python
df = dd.read_parquet("file.parquet")
df = df[df["name"] == "Alice"]
```

is optimized to

```python
df = dd.read_parquet("file.parquet", filters=[("name", "=", "Alice")])
```

The idea of a predicate pull-up is similar, but applied to the other side of an join / merge:

```python
left = pd.read_parquet("left.parquet")
right = pd.read_parquet("right.parquet")

left = left[left["name"] == "Alice"]

# Both left and right should get the `filters=` pushed down to the read_parquet
result = left.merge(right, on="name", how="inner")
```

We know that the result of the inner join will only have `name=="Alice"` since the `left` DataFrame will only have `name=="Alice"` thanks to the preceding filter. Because the filter column is also the join column, dask-expr should be able to pull that filter all the way up through the merge and then push it down to the `right = pd.read_parquet`, just like it does for the `left` side.

https://github.com/dask/dask-expr/compare/main...TomAugspurger:dask-expr:tom/predicate-pull-up?expand=1 has an initial cut at this that I'll turn into a PR soon. The basic idea is to implement `Merge._simplify_down` to check for the prerequisites for this optimization (I've only implemented a very specific case, but I think the main / exclusive requirement is that there's a join and filter on the same column). I *think* that `_simplify_down` is the appropriate place to do this. IIUC, the `Merge` is "above" the `read_parquet`. We want to take the `Filter` that's on some side of the `Merge`, pull it up to the `Merge, and push it down the other side.

(I don't know whether this is an important optimization in practice, but it seemed like a pretty good problem to learn a bit about how dask-expr works).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Predicate pull-up optimization #1169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Predicate pull-up optimization #1169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions