Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add variants of functions optimized for sorted keys #77

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Collaborator

@adriangb adriangb commented Feb 21, 2025

Included a benchmark demonstrating optimization.

json_get_str_negative    time:   [9.4697 µs 9.5514 µs 9.6921 µs]
json_get_str_sorted_negative    time:   [53.016 ns 53.298 ns 53.610 ns]

@adriangb adriangb requested review from davidhewitt and samuelcolvin and removed request for davidhewitt February 21, 2025 07:02
@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.73%. Comparing base (7cecdd1) to head (6009223).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   82.62%   83.73%   +1.10%     
==========================================
  Files          15       15              
  Lines        1180     1260      +80     
  Branches     1180     1260      +80     
==========================================
+ Hits          975     1055      +80     
  Misses        144      144              
  Partials       61       61              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@adriangb adriangb requested a review from davidhewitt February 21, 2025 07:10
Copy link
Collaborator

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel split about this optimization.

We're looking at maybe at best an on average 50% faster result for the lookup step (not sure how much that dominates total execution time), but it's adding a lot of complexity to this codebase, probably leads to 3x multiple on output code size and it doesn't degrade nicely; if users get the sortedness heuristic wrong it straight up produces the wrong result with no way to avoid that. That sucks imo.

It feels like if we know we want to have invariants about the data that optimize lookup (e.g. sortedness), then we have to do up-front data prep which might as well be building an optimized binary format (e.g. batson / JSONB-like I guess). That'd be even more performant than this here, without the drawbacks.

On balance, maybe we use this in the short term until we have a binary format ready but that doesn't feel like a good reason to merge this.

Comment on lines +19 to +35
r"Get the keys of a JSON object as an array.",
Sortedness::Unspecified
);

make_udf_function!(
JsonObjectKeys,
json_keys_sorted,
json_data path,
r"Get the keys of a JSON object as an array; assumes the JSON object's keys are sorted.",
Sortedness::TopLevel
);

make_udf_function!(
JsonObjectKeys,
json_keys_recursive_sorted,
json_data path,
r"Get the keys of a JSON object as an array; assumes all object's keys are sorted.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs some bit about "at the given path" in the docs here.

Also names are inconsistent, should it be json_object_keys_top_level_sorted etc?

Comment on lines 79 to 84
// Replace nested JSON functions e.g. `json_get(json_get(col, 'foo'), 'bar')` with `json_get(col, 'foo', 'bar')`
fn unnest_json_calls(func: &ScalarFunction) -> Option<Transformed<Expr>> {
if !matches!(
func.func.name(),
"json_get"
| "json_get_bool"
| "json_get_float"
| "json_get_int"
| "json_get_json"
| "json_get_str"
| "json_as_text"
) {
if !JSON_FUNCTION_NAMES.contains(&func.func.name().to_owned()) {
return None;
}
let mut outer_args_iter = func.args.iter();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's some subtlety here, e.g. you can't trivially merge functions with incompatible sortedness. e.g. json_get_recursive_sorted(json_get_top_level_sorted(col, 'foo'), 'bar')) , is the corresponding merge poorly defined, should it be recursive sorted all the way, or ???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah good point, we shouldn't merge across sortedness. Guess this needs a test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the rewrite is unchanged in this PR. How do the new functions get selected (other than manually by users)?

@adriangb
Copy link
Collaborator Author

Real world testing in our system showed a 5x performance improvement on some common queries (searching for url.query which is alphabetically not even that favorable of a key; the whole query was 5x faster not just a small part of it).

The big advantage of this over creating a binary format is that it's backward compatible with JSON text. Basically anyone with a JSON text column can benefit from this without breaking any downstream systems that might need this to be JSON text and not a custom binary format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants