add variants of functions optimized for sorted keys #77

adriangb · 2025-02-21T07:01:58Z

Included a benchmark demonstrating optimization.

json_get_str_negative    time:   [9.4697 µs 9.5514 µs 9.6921 µs]
json_get_str_sorted_negative    time:   [53.016 ns 53.298 ns 53.610 ns]

codecov-commenter · 2025-02-21T07:05:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.73%. Comparing base (7cecdd1) to head (6009223).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   82.62%   83.73%   +1.10%     
==========================================
  Files          15       15              
  Lines        1180     1260      +80     
  Branches     1180     1260      +80     
==========================================
+ Hits          975     1055      +80     
  Misses        144      144              
  Partials       61       61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

davidhewitt

I feel split about this optimization.

We're looking at maybe at best an on average 50% faster result for the lookup step (not sure how much that dominates total execution time), but it's adding a lot of complexity to this codebase, probably leads to 3x multiple on output code size and it doesn't degrade nicely; if users get the sortedness heuristic wrong it straight up produces the wrong result with no way to avoid that. That sucks imo.

It feels like if we know we want to have invariants about the data that optimize lookup (e.g. sortedness), then we have to do up-front data prep which might as well be building an optimized binary format (e.g. batson / JSONB-like I guess). That'd be even more performant than this here, without the drawbacks.

On balance, maybe we use this in the short term until we have a binary format ready but that doesn't feel like a good reason to merge this.

davidhewitt · 2025-02-27T09:40:54Z

src/json_object_keys.rs

+    r"Get the keys of a JSON object as an array.",
+    Sortedness::Unspecified
+);
+
+make_udf_function!(
+    JsonObjectKeys,
+    json_keys_sorted,
+    json_data path,
+    r"Get the keys of a JSON object as an array; assumes the JSON object's keys are sorted.",
+    Sortedness::TopLevel
+);
+
+make_udf_function!(
+    JsonObjectKeys,
+    json_keys_recursive_sorted,
+    json_data path,
+    r"Get the keys of a JSON object as an array; assumes all object's keys are sorted.",


Probably needs some bit about "at the given path" in the docs here.

Also names are inconsistent, should it be json_object_keys_top_level_sorted etc?

davidhewitt · 2025-02-27T09:57:37Z

src/rewrite.rs

 // Replace nested JSON functions e.g. `json_get(json_get(col, 'foo'), 'bar')` with `json_get(col, 'foo', 'bar')`
 fn unnest_json_calls(func: &ScalarFunction) -> Option<Transformed<Expr>> {
-    if !matches!(
-        func.func.name(),
-        "json_get"
-            | "json_get_bool"
-            | "json_get_float"
-            | "json_get_int"
-            | "json_get_json"
-            | "json_get_str"
-            | "json_as_text"
-    ) {
+    if !JSON_FUNCTION_NAMES.contains(&func.func.name().to_owned()) {
        return None;
    }
    let mut outer_args_iter = func.args.iter();


I feel like there's some subtlety here, e.g. you can't trivially merge functions with incompatible sortedness. e.g. json_get_recursive_sorted(json_get_top_level_sorted(col, 'foo'), 'bar')) , is the corresponding merge poorly defined, should it be recursive sorted all the way, or ???

Hmm yeah good point, we shouldn't merge across sortedness. Guess this needs a test.

davidhewitt · 2025-02-27T10:03:34Z

src/rewrite.rs

At the moment the rewrite is unchanged in this PR. How do the new functions get selected (other than manually by users)?

adriangb · 2025-02-27T13:35:00Z

Real world testing in our system showed a 5x performance improvement on some common queries (searching for url.query which is alphabetically not even that favorable of a key; the whole query was 5x faster not just a small part of it).

The big advantage of this over creating a binary format is that it's backward compatible with JSON text. Basically anyone with a JSON text column can benefit from this without breaking any downstream systems that might need this to be JSON text and not a custom binary format.

adriangb · 2025-03-24T18:53:54Z

Closing as we've decided to not prioritize this stepping stone optimization and instead focus on Open Variant shredding / binary formats.

adriangb requested review from davidhewitt and samuelcolvin and removed request for davidhewitt February 21, 2025 07:02

adriangb requested a review from davidhewitt February 21, 2025 07:10

davidhewitt reviewed Feb 27, 2025

View reviewed changes

adriangb added 6 commits March 20, 2025 11:42

add variants of functions optimized for sorted keys

6a7827a

lint

2a8acb8

lint

a85e124

fix optimizations

b1215fb

fix bug in unnest

58f56cd

revert

1dc59bd

adriangb force-pushed the sorted-json branch from e5cf9bf to 1dc59bd Compare March 20, 2025 16:42

adriangb added 3 commits March 20, 2025 13:01

Prepare for DataFusion 47

aff3e71

update

a739fd7

Merge branch 'update-methods' into sorted-json

6009223

adriangb closed this Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add variants of functions optimized for sorted keys #77

add variants of functions optimized for sorted keys #77

Uh oh!

adriangb commented Feb 21, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 21, 2025 •

edited

Loading

Uh oh!

davidhewitt left a comment

Uh oh!

davidhewitt Feb 27, 2025

Uh oh!

davidhewitt Feb 27, 2025

Uh oh!

adriangb Feb 27, 2025

Uh oh!

davidhewitt Feb 27, 2025

Uh oh!

adriangb commented Feb 27, 2025

Uh oh!

adriangb commented Mar 24, 2025

Uh oh!

Uh oh!

add variants of functions optimized for sorted keys #77

add variants of functions optimized for sorted keys #77

Uh oh!

Conversation

adriangb commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

davidhewitt left a comment

Choose a reason for hiding this comment

Uh oh!

davidhewitt Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

davidhewitt Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

davidhewitt Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Feb 27, 2025

Uh oh!

adriangb commented Mar 24, 2025

Uh oh!

Uh oh!

adriangb commented Feb 21, 2025 •

edited

Loading

codecov-commenter commented Feb 21, 2025 •

edited

Loading