Skip to content

merge does not use column stats to skip files? #2701

Answered by ion-elgreco
Zan-L asked this question in Q&A
Discussion options

You must be logged in to vote

Merge and write now support streamed execution since v0.25 which should improve memory pressure a lot! For write it's always enabled because there is no downside to it. For merge there is an opt-out by setting streamed_exec=False the reason is the existing merge implementation can't use the stats from the source data to prune down the target further when you do a streamed execution, since deriving stats would require you to materialize the table in memory. So when you have streamed_exec=true, it's important to be more explicit with your partition predicate (e.g. partition in ['foo","bar"])

#1984 you linked is about using multipart writer and a potential refactor to use more datafusion com…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ion-elgreco
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants