dynamic filter refactor #15685

jayzhan211 · 2025-04-11T13:45:22Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

adriangb

I think for any alternative proposal to be viable it needs to work with the current tests without calling new methods, or making copies differently, etc. The test was crafted that way because it reflects what actually happens during execution based on #15301

adriangb · 2025-04-11T14:01:58Z

datafusion/physical-expr/src/expressions/dynamic_filters.rs

+        let dynamic_filter_1 = dynamic_filter.with_schema(Arc::clone(&filter_schema_1));
+        let snap_1 = dynamic_filter_1.snapshot().unwrap().unwrap();
+        insta::assert_snapshot!(format!("{snap_1:?}"), @r#"BinaryExpr { left: Column { name: "a", index: 0 }, op: Eq, right: Literal { value: Int32(42) }, fail_on_overflow: false }"#);
+        let dynamic_filter_2 = dynamic_filter.with_schema(Arc::clone(&filter_schema_2));


What is expected to call with_schema? This seems like a new method on DynamicFilterPhysicalExpr. We need reassign_predicate_columns to work, that's what gets called from within ParquetSource, etc.

When you need reassign_predicate_columns, basically it re-project columns based on the provided schema.

Right but the constraint is that we can't modify what reassign_predicate_columns and similar do internally, otherwise that's a ton of API churn. The existing design works within the confines of the existing APIs.

We need reassign_predicate_columns to work, that's what gets called from within ParquetSource, etc

We can also change reassign_predicate_columns inside Parquet if that bring us to the better state. From my view, reassign_predicate_columns is the root cause why you ends up with_new_children that doesn't actually updating the "children".

otherwise that's a ton of API churn.

We only use in DatafusionArrowPredicate, is there any other places we need to change?

The main point is that with_new_children in the main branch isn't doing the right thing, it should update the source filter instead, but you only update the remapped filter schema. I think the filter schema is "parameters" for remapping column indexes, it doesn't need to be part of the filter expressions at all.

If you really want to keep reassign_predicate_columns for whatever reason, you should pass the inner of DynamicFilterPhysicalExpr instead, so you are only modifying the inner and not the whole DynamicFilterPhysicalExpr. The difference is that the with_new_children is not called on the DynamicFilterPhysicalExpr level.

adriangb · 2025-04-11T14:03:43Z

datafusion/physical-expr/src/expressions/dynamic_filters.rs

-        let inner =
-            Self::remap_children(&self.children, self.remapped_children.as_ref(), inner)?;


This means that evaluate no longer uses a version with remapped children right?

now you evaluate based on the snapshot. snapshot is the remapped filter with your filter schema

I feel like we're going in circles... this is the thing that was expected to be used to produce the snapshots... now we're using snapshots to produce it?

I think the difference is that

Your version

we have source filter A

create dynamic filter a and b by different filter schema

you create snapshot a, b from them.

evaluate batches by snapshot

update source filter A to B

dynamic filter a, b remap based on source filter B when you call evaluate

My version

we have source filter A

create snapshot based on source filter A + filter schema A and B

evaluate batches by snapshot

update source filter A to B

create snapshot based on source filter B + filter schema A and B

evaluate batches by snapshot

snapshot is actually an evaluated dynamic filter based on the schema.

Basically, what you have is just the filter expression. You provide the schema to remap the column indexes. You get yet another filter expression.

adriangb · 2025-04-11T15:49:24Z

datafusion/physical-expr/src/expressions/dynamic_filters.rs

-    inner: Arc<RwLock<PhysicalExprRef>>,
+    inner: PhysicalExprRef,


@jayzhan211 how can this have multiple readers and a writer updating with some sort of write lock?

I think we don't need it. Given a source filter, you create snapshot with the schema. Then you evaluate based on the remapped filter. When you need a new source filter, instead of updating it, just create a new one

But how do you pipe the new filter down into other operators?

The whole point is that you can create a filter at planning time, bind it to a ParquetSource and a SortExec (for example) and then the SortExec can dynamically update it at runtime.

The whole point is that you can create a filter at planning time, bind it to a ParquetSource and a SortExec (for example) and then the SortExec can dynamically update it at runtime.

instead of sending the filter down, the change I have is sending the filter schema down. It is used to create another filter (snapshot) in SortExec dynamically at runtime.

adriangb · 2025-04-11T15:49:49Z

datafusion/physical-expr/src/expressions/dynamic_filters.rs

+    pub fn update(&mut self, filter: PhysicalExprRef) {
+        self.inner = filter;


How will writers have mutable access to this if they have to package it up in an Arc?

jayzhan211 · 2025-04-12T02:50:15Z

#15568 (comment)

why the change is equivalent to yours in the high level idea.

DynamicFilterPhysicalExpr gets initialized at planning time with a known set of children but a placeholder expression (lit(true))

The same.

with_new_children is called making a new DynamicFilterPhysicalExpr but with the children replaced (let's ignore how that happens internally for now)

We need to replace children, and we can achieve this and get the result by snapshot

update is called on the original reference with an expression that references the original children. This is propagated to all references, including those with new children, because of the Arc<RwLock<...>>.

We can keep Arc<RwLock<T>> and update the inner or even create another new source filter.

evaluate is called on one of the references that previously had with_new_children called on it. Since update was called, which swapped out inner, the children of this new inner need to be remapped to the children that we currently expose externally.

We can call evaluate on snapshot because snapshot is already remapped. Your version need to remap for each evaluate called, but snapshot in my version done it once, and we evaluate on it without remap.

The improvement of this chanage

We have correct with_new_children because we update the source filter now.
DynamicFilterPhysicalExpr is basically filter expression: Arc<dyn PhysicalExpr>>. We have a simple interface with the same capability now.

Concern

reassign_predicate_columns is replaced by snapshot but you think we can't do this kind of change because of API churn. I think this is not an issue because only DatafusionArrowPredicate is used.
Do we need Lock for source filter? I think we can create another new DynamicFilterPhysicalExpr at all. But maybe there is some reasons we can't, we can discuss further on this.

adriangb · 2025-04-12T05:35:41Z

Hey @jayzhan211 thank you for putting the work into trying to clarify this.

At this point I think it would be best to wait for #15566 or a PR that replaces it to be merged so that we can work against an actual use case / implementation of these dynamic filters. Otherwise I think it's a bit hard to communicate in such abstract terms. Once we're looking at a concrete use case it will be easier to make a PR to replace this implementation.

Would it be okay with you to wait until that happens to continue this discussion?

Sorry if merging a PR with a bad implementation becomes problematic... luckily it's problematic for us, not end users, since this is all private implementations.

berkaysynnada · 2025-04-17T15:44:50Z

Blockers are gone. I think we can focus on this now

github-actions · 2025-06-18T02:11:44Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

refactor

029c84e

github-actions bot added the physical-expr Changes to the physical-expr crates label Apr 11, 2025

jayzhan211 mentioned this pull request Apr 11, 2025

Introduce DynamicFilterSource and DynamicPhysicalExpr #15568

Merged

adriangb reviewed Apr 11, 2025

View reviewed changes

DynamicFilterPhysicalExpr is PhysicalExprRef

438292a

adriangb reviewed Apr 11, 2025

View reviewed changes

github-actions bot added the Stale PR has not had any activity for some time label Jun 18, 2025

github-actions bot closed this Jun 26, 2025

		let inner =
		Self::remap_children(&self.children, self.remapped_children.as_ref(), inner)?;

		pub fn update(&mut self, filter: PhysicalExprRef) {
		self.inner = filter;

dynamic filter refactor #15685

dynamic filter refactor #15685

Uh oh!

Conversation

jayzhan211 commented Apr 11, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayzhan211 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why the change is equivalent to yours in the high level idea.

The improvement of this chanage

Concern

Uh oh!

adriangb commented Apr 12, 2025

Uh oh!

berkaysynnada commented Apr 17, 2025

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

Uh oh!

jayzhan211 commented Apr 12, 2025 •

edited

Loading