-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add variants of functions optimized for sorted keys #77
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #77 +/- ##
==========================================
+ Coverage 82.62% 83.73% +1.10%
==========================================
Files 15 15
Lines 1180 1260 +80
Branches 1180 1260 +80
==========================================
+ Hits 975 1055 +80
Misses 144 144
Partials 61 61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel split about this optimization.
We're looking at maybe at best an on average 50% faster result for the lookup step (not sure how much that dominates total execution time), but it's adding a lot of complexity to this codebase, probably leads to 3x multiple on output code size and it doesn't degrade nicely; if users get the sortedness heuristic wrong it straight up produces the wrong result with no way to avoid that. That sucks imo.
It feels like if we know we want to have invariants about the data that optimize lookup (e.g. sortedness), then we have to do up-front data prep which might as well be building an optimized binary format (e.g. batson / JSONB-like I guess). That'd be even more performant than this here, without the drawbacks.
On balance, maybe we use this in the short term until we have a binary format ready but that doesn't feel like a good reason to merge this.
r"Get the keys of a JSON object as an array.", | ||
Sortedness::Unspecified | ||
); | ||
|
||
make_udf_function!( | ||
JsonObjectKeys, | ||
json_keys_sorted, | ||
json_data path, | ||
r"Get the keys of a JSON object as an array; assumes the JSON object's keys are sorted.", | ||
Sortedness::TopLevel | ||
); | ||
|
||
make_udf_function!( | ||
JsonObjectKeys, | ||
json_keys_recursive_sorted, | ||
json_data path, | ||
r"Get the keys of a JSON object as an array; assumes all object's keys are sorted.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs some bit about "at the given path" in the docs here.
Also names are inconsistent, should it be json_object_keys_top_level_sorted
etc?
// Replace nested JSON functions e.g. `json_get(json_get(col, 'foo'), 'bar')` with `json_get(col, 'foo', 'bar')` | ||
fn unnest_json_calls(func: &ScalarFunction) -> Option<Transformed<Expr>> { | ||
if !matches!( | ||
func.func.name(), | ||
"json_get" | ||
| "json_get_bool" | ||
| "json_get_float" | ||
| "json_get_int" | ||
| "json_get_json" | ||
| "json_get_str" | ||
| "json_as_text" | ||
) { | ||
if !JSON_FUNCTION_NAMES.contains(&func.func.name().to_owned()) { | ||
return None; | ||
} | ||
let mut outer_args_iter = func.args.iter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there's some subtlety here, e.g. you can't trivially merge functions with incompatible sortedness. e.g. json_get_recursive_sorted(json_get_top_level_sorted(col, 'foo'), 'bar'))
, is the corresponding merge poorly defined, should it be recursive sorted all the way, or ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah good point, we shouldn't merge across sortedness. Guess this needs a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the rewrite is unchanged in this PR. How do the new functions get selected (other than manually by users)?
Real world testing in our system showed a 5x performance improvement on some common queries (searching for The big advantage of this over creating a binary format is that it's backward compatible with JSON text. Basically anyone with a JSON text column can benefit from this without breaking any downstream systems that might need this to be JSON text and not a custom binary format. |
Included a benchmark demonstrating optimization.