-
Notifications
You must be signed in to change notification settings - Fork 1.4k
dynamic filter refactor #15685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
dynamic filter refactor #15685
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for any alternative proposal to be viable it needs to work with the current tests without calling new methods, or making copies differently, etc. The test was crafted that way because it reflects what actually happens during execution based on #15301
let dynamic_filter_1 = dynamic_filter.with_schema(Arc::clone(&filter_schema_1)); | ||
let snap_1 = dynamic_filter_1.snapshot().unwrap().unwrap(); | ||
insta::assert_snapshot!(format!("{snap_1:?}"), @r#"BinaryExpr { left: Column { name: "a", index: 0 }, op: Eq, right: Literal { value: Int32(42) }, fail_on_overflow: false }"#); | ||
let dynamic_filter_2 = dynamic_filter.with_schema(Arc::clone(&filter_schema_2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is expected to call with_schema
? This seems like a new method on DynamicFilterPhysicalExpr
. We need reassign_predicate_columns
to work, that's what gets called from within ParquetSource
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you need reassign_predicate_columns
, basically it re-project columns based on the provided schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but the constraint is that we can't modify what reassign_predicate_columns
and similar do internally, otherwise that's a ton of API churn. The existing design works within the confines of the existing APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need reassign_predicate_columns to work, that's what gets called from within ParquetSource, etc
We can also change reassign_predicate_columns
inside Parquet if that bring us to the better state. From my view, reassign_predicate_columns
is the root cause why you ends up with_new_children
that doesn't actually updating the "children".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise that's a ton of API churn.
We only use in DatafusionArrowPredicate
, is there any other places we need to change?
The main point is that with_new_children
in the main branch isn't doing the right thing, it should update the source filter instead, but you only update the remapped filter schema. I think the filter schema is "parameters" for remapping column indexes, it doesn't need to be part of the filter expressions at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to keep reassign_predicate_columns
for whatever reason, you should pass the inner
of DynamicFilterPhysicalExpr
instead, so you are only modifying the inner
and not the whole DynamicFilterPhysicalExpr
. The difference is that the with_new_children
is not called on the DynamicFilterPhysicalExpr
level.
let inner = | ||
Self::remap_children(&self.children, self.remapped_children.as_ref(), inner)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that evaluate
no longer uses a version with remapped children right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now you evaluate based on the snapshot. snapshot is the remapped filter with your filter schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we're going in circles... this is the thing that was expected to be used to produce the snapshots... now we're using snapshots to produce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the difference is that
Your version
- we have source filter A
- create dynamic filter a and b by different filter schema
- you create snapshot a, b from them.
- evaluate batches by snapshot
- update source filter A to B
- dynamic filter a, b remap based on source filter B when you call evaluate
My version
- we have source filter A
- create snapshot based on source filter A + filter schema A and B
- evaluate batches by snapshot
- update source filter A to B
- create snapshot based on source filter B + filter schema A and B
- evaluate batches by snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snapshot is actually an evaluated dynamic filter based on the schema.
Basically, what you have is just the filter expression. You provide the schema to remap the column indexes. You get yet another filter expression.
inner: Arc<RwLock<PhysicalExprRef>>, | ||
inner: PhysicalExprRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 how can this have multiple readers and a writer updating with some sort of write lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need it. Given a source filter, you create snapshot with the schema. Then you evaluate based on the remapped filter. When you need a new source filter, instead of updating it, just create a new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do you pipe the new filter down into other operators?
The whole point is that you can create a filter at planning time, bind it to a ParquetSource and a SortExec (for example) and then the SortExec can dynamically update it at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point is that you can create a filter at planning time, bind it to a ParquetSource and a SortExec (for example) and then the SortExec can dynamically update it at runtime.
instead of sending the filter down, the change I have is sending the filter schema down. It is used to create another filter (snapshot) in SortExec dynamically at runtime.
pub fn update(&mut self, filter: PhysicalExprRef) { | ||
self.inner = filter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will writers have mutable access to this if they have to package it up in an Arc
?
why the change is equivalent to yours in the high level idea.
The same.
We need to replace children, and we can achieve this and get the result by
We can keep
We can call The improvement of this chanage
Concern
|
Hey @jayzhan211 thank you for putting the work into trying to clarify this. At this point I think it would be best to wait for #15566 or a PR that replaces it to be merged so that we can work against an actual use case / implementation of these dynamic filters. Otherwise I think it's a bit hard to communicate in such abstract terms. Once we're looking at a concrete use case it will be easier to make a PR to replace this implementation. Would it be okay with you to wait until that happens to continue this discussion? Sorry if merging a PR with a bad implementation becomes problematic... luckily it's problematic for us, not end users, since this is all private implementations. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?