Skip to content

spec: Variant lower/upper bounds #12658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 21, 2025

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Mar 26, 2025

This is to revise the bounds specification for Variant. In summary:

The writer determines which fields to collect bounds for in a Variant column. Field bounds are stored as serialized Variant objects, where each key is a normalized JSON path identifying a field, and each value is the corresponding lower or upper bound.

E.g.

For a Variant column with the schema as follows:

{
  "event_type": "login",
  "user.name": "Alex", 
  "tags": ["action", "drama"]
}

The collected bound object looks like:

{
  "$['event_type']": "login",
  "$['user.name']": "Alex",
  "$['tags']": "action"
}

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Mar 26, 2025
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Mar 26, 2025

Copy link
Contributor

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danielcweeks
Copy link
Contributor

@aihuaxu and @rdblue is there a reason we need to explicitly restrict the lower/upper bounds to shredded fields? I would think that the stats pruning would be useful for any field that a writer would want to include in the bound (regardless of whether it was shredded or not).

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Apr 4, 2025

@aihuaxu and @rdblue is there a reason we need to explicitly restrict the lower/upper bounds to shredded fields? I would think that the stats pruning would be useful for any field that a writer would want to include in the bound (regardless of whether it was shredded or not).

What we were thinking is that the bounds are collected from shredded column stats during shredding process. But it does seem reasonable to me to bounds and shredding can be separated: if a writer has the knowledge of the bounds and chooses not to shred, the bounds can still be used in pruning.

Improve the wording

Co-authored-by: Ryan Blue <[email protected]>
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I think we need to clarify a bit here

format/spec.md Outdated
* `$` -- the Variant root value
* `$['user.name']` -- the field `"user.name"` in the root value that is a Variant object
* `$['location']['latitude']` -- the field `latitude` in a nested `location` object
* `$['ids']` -- the `ids` array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read the thread you raised on the dev list, I liked that you used "tags" as the example. Maybe we should change some of these to match the examples in the variant shredding spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do that to match as much as possible.

aihuaxu and others added 2 commits April 18, 2025 16:46
@aihuaxu aihuaxu force-pushed the variant-bound-spec branch from b16f660 to 9a99971 Compare April 19, 2025 16:37
@amogh-jahagirdar
Copy link
Contributor

Thanks @RussellSpitzer @rdblue @Fokko @flyrain @huaxingao @XBaith for reviewing and everyone for voting. Since the vote passed, I'll go ahead and merge

@amogh-jahagirdar amogh-jahagirdar merged commit cc2e0ff into apache:main Apr 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants