-
Couldn't load subscription status.
- Fork 118
fix: re-enable timestamp column for data skipping based on max values stats #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: re-enable timestamp column for data skipping based on max values stats #1333
Conversation
|
@zachschuermann - I’ve implemented the improvement item you created, and I’m wondering if you can review it. We’ve found this functionality to be really critical for our use case. I think this is a very common use case, and it would be great to improve it. |
|
awesome, thanks for tackling this @sgrebnov! Should be able to get to a review today :) |
| let max_ts_adjusted = timestamp_subtract(val, 999); | ||
| tracing::debug!( | ||
| "Adjusted timestamp value for col {col} for max stat comparison from {val:?} to {max_ts_adjusted:?}" | ||
| ); | ||
| return self.eval_partial_cmp(ord, max, &max_ts_adjusted, inverted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this always be subtracting the val?
Suppose the max is 2500, but was truncated to 2000.
Suppose that value is 2250. => adjusted to 1251
for expression: value < max
actual: 2250 < 2500 == true
adjusted/truncated: 1251 < 2000 == true.
=> correctly not filtered
for expression: value > max
actual: 2250 > 2500 == false
adjusted/truncated: 1251 > 2000 == false
=> correctly filtered
now suppose value is 2750 => adjusted to 1751
for expression: value < max:
actual: 2750 < 2500 == false
adjusted/truncated: 1751 < 2000 == true
=> not filtered (safe)
for expression: value > max:
actual: 2750 > 2500 == true
adjusted/truncated: 1751 > 2000 == false
=> filtered (not safe)❗ ❗
Am I missing something here? Could you add some extra details/context in comments to serve as a proof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that we're never doing a comparison value > max since we only ever want to check value <= max => ! (value < max).
EDIT: should be value <= max => ! (value > max).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OussamaSaoudi - thank you for the deep dive and review! Yeah, I was thinking about the same when working on the fix - we only do max > value or !(max < value), decreasing val helps keep more records in both cases. Please let me know if you would like me to be more specific and add match for ord=Greater, inverted=false and ord=Less, inverted=true only
Other reasoning I used:
- If min is always truncated to lower value and we don't do special logic than the same should work with max increased (but we increase max by decreasing val used for comparison).
- if we can increase max, for example have (max + 999) and this is valid (keep more records):
(max + 999) operator valthen(max + 999-999) operator val-999is valid as well so we have our current logic(max operator val-999case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if you would like me to be more specific and add match for ord=Greater, inverted=false and ord=Less, inverted=true only
Good idea! Let's handle by cases.
max > valueormax >= value. Represented by: (ord=Greater, inverted=false) and (ord=Less, inverted=true)- Proof: If
max >= value + 999, it must be the case thatmax >= value
- Proof: If
max < valueormax <= value. (ord=Less, inverted=false) and (ord=Greater, inverted=true)- Proof: if
max <= value - 999, then it must be the case thatmax <= value - We can adjust this to
max + 999 <= valueto avoid underflow
- Proof: if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @scovich Would love your eyes on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we only have four rewrites for inequalities:
col < val=>stats.min.col < valNOT(col < val)=>NOT(stats.max.col < val)col > val=>stats.max.col > valNOT(col > val)=>NOT(stats.min.col > val)
If I understand correctly, the two involving max stats would be changed to:
col > val=>stats.max.col > val - 999NOT(col < val)=>NOT(stats.max.col < val - 999)
And because this is data skipping, we only care whether the new rewrite might produce a FALSE where the old rewrite did not produce FALSE. Because that corresponds to wrongly skipping a file.
For the first: if stats.max.col > val - 999 is FALSE, then the max-value is "too small" and stats.max.col > val must also return FALSE.
For the second, let's simplify a bit by pushing down the NOT:
NOT(stats.max.col < val - 999)=>stats.max.col >= val - 999
If stats.max.col >= val - 999 is FALSE, then the max-value is again "too small" and stats.max.col >= val must also return FALSE.
AFAICT, the rewrite is sound, because any time it returns FALSE the original predicate also returned FALSE.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1333 +/- ##
=======================================
Coverage 84.38% 84.39%
=======================================
Files 112 112
Lines 27773 27766 -7
Branches 27773 27766 -7
=======================================
- Hits 23437 23433 -4
+ Misses 3202 3199 -3
Partials 1134 1134 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple high level thoughts:
- My gut feeling is that it would be better to put the checks directly in
DataSkippingPredicateEvaluator::eval_pred_[le|gt|eq]methods, because the Ordering is hard-wired in each of those three places. Helper functions to reduce boilerplate are always encouraged when appropriate. What do others think of that idea? - What about strings? IIRC, Delta-spark truncates string stats to some maximum number of characters which causes similar headaches for data skipping over the max-stat. Do we even try to handle that today in kernel? I didn't see any obvious code, not sure if there's even an issue tracking it. Meanwhile, I wonder if we could use a similar trick, by ensuring that the
valwe compare is always at least one character shorter than the max-stat value? Except we don't know that length before hand, and it could vary from file to file. Hmm.
| return self.eval_partial_cmp(ord, max, &max_ts_adjusted, inverted); | ||
| } | ||
| // Equality comparison can't be applied as max stats is truncated to milliseconds, so actual microsecond value is unknown. | ||
| Ordering::Equal => return None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equality is anyway rewritten as a pair of inequalities (one on min-stat, and another on max-stat).
See DataSkippingPredicateEvaluator::eval_pred_eq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to still used for != predicate (eval_pred_eq with inverted: true):
https://github.com/delta-io/delta-kernel-rs/blob/main/kernel/src/kernel_predicates/mod.rs#L898
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so we never actually use == only !=. Also, IIRC, the != case can be expressed as a pair of range inequalities, which would simplify things because then eval_pred_eq would just reuse whatever eval_pred_lt and eval_pred_gt did, as needed.
Thank you for review @scovich 🙏 1- I'll rely on the team here and will be looking for the guidance/decision, one concern to move this logic directly into |
Yes, but each site is significantly simpler and clearer. My intuition is that it will be a net win. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgrebnov any updates here? how can we help?
|
@zachschuermann, @scovich - thank you for the additional details and suggestions - I’ll go through them in detail and follow up over the next few days. |
awesome thanks! just wanted to check in :) |
What changes are proposed in this pull request?
PR implements re-enable file skipping on timestamp columns #1002 by subtracting 999 microseconds from predicate value used for filtering keeping potentially good entries. Note -
DataSkippingPredicateEvaluatoronly performs initial filtering, final filtering is done based on actual data content/metadata so it is valid to relax filter here.This is important optimization in order to fetch only data that corresponds to specific time window, for example last X days.
The following approach was also reviewed/considered: instead of modifying
val, generate a comparison expression that adds 999 µs to the max value. However, this ran into a limitation—DataFusion does not supportTimestamp + Int64operations (it requires anInterval). Since the Delta Kernel/Delta spec does not have anIntervaltype, introducing a new type forExprseemed unreasonable.How was this change tested?
Updated unit tests and tested manually using delta lake table (Databricks liquid table)