-
-
Notifications
You must be signed in to change notification settings - Fork 27
Description
https://duckdb.org/2024/11/14/optimizers.html#filter-pull-up--filter-pushdown has a nice description of filter pull up, an optimization in DuckDB that I'd like to implement in dask-expr as a learning exercise. dask-expr currently implements predicate push down, where a read_parquet followed by a filter on the values of a column is translated into a read_parquet with the filters set appropriately:
df = dd.read_parquet("file.parquet")
df = df[df["name"] == "Alice"]is optimized to
df = dd.read_parquet("file.parquet", filters=[("name", "=", "Alice")])The idea of a predicate pull-up is similar, but applied to the other side of an join / merge:
left = pd.read_parquet("left.parquet")
right = pd.read_parquet("right.parquet")
left = left[left["name"] == "Alice"]
# Both left and right should get the `filters=` pushed down to the read_parquet
result = left.merge(right, on="name", how="inner")We know that the result of the inner join will only have name=="Alice" since the left DataFrame will only have name=="Alice" thanks to the preceding filter. Because the filter column is also the join column, dask-expr should be able to pull that filter all the way up through the merge and then push it down to the right = pd.read_parquet, just like it does for the left side.
https://github.com/dask/dask-expr/compare/main...TomAugspurger:dask-expr:tom/predicate-pull-up?expand=1 has an initial cut at this that I'll turn into a PR soon. The basic idea is to implement Merge._simplify_down to check for the prerequisites for this optimization (I've only implemented a very specific case, but I think the main / exclusive requirement is that there's a join and filter on the same column). I think that _simplify_down is the appropriate place to do this. IIUC, the Merge is "above" the read_parquet. We want to take the Filter that's on some side of the Merge, pull it up to the `Merge, and push it down the other side.
(I don't know whether this is an important optimization in practice, but it seemed like a pretty good problem to learn a bit about how dask-expr works).