-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Allow the use of source-id
in V3
#12644
base: main
Are you sure you want to change the base?
Conversation
@@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p | |||
|
|||
| V1 | V2 | V3 | Field | JSON representation | Example | | |||
|----------|----------|----------|------------------|---------------------|--------------| | |||
| required | required | omitted | **`source-id`** | `JSON int` | 1 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we still have a reader which reads both source-id and source-ids? Basically in our motto of reading potentially out of spec but writing in spec, we can always check if source-id is set as well as source-ids. The only time we need to break is if the two don't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should the value be if source-ids
is present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are implementation details but I would suggest if source-id and source-ids are both there
source-ids = list(source-id) - Fine on read, on write omit source-id
source-ids != list(source-id) - Error on read
source-id = x && source-ids == null - Fine on read, promote to list(source-id) on write
source-id = null && source-ids = list - Spec compliant config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we should break if they doesn't match. We have a similar situation with schema
and schemas
with current-schema-id
that might point to two different schemas.
When we write both, we can resolve that, but I think we don't need them both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my suggestion is that we still say in V3 that source-id is omitted but in the implementation we can still handle an out of spec metadata if we come across one. That way the spec reader code doesn't have to be version dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both source-ids
and source-id
are written for single-arg transform, a v3-supported reader can take a shortcut on source-ids
and only fall back to source-id
when source-ids
is missing. Isn't this a desirable feature? Though someone may argue that we have to check the consistency between them like the above discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way the spec reader code doesn't have to be version dependent.
My goal was to be completely independent of the table version, also on the write side. Why not entirely remove this constraint by writing source-id
or source-ids
based on the number of arguments? :) I think they should be written mutually exclusive, to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for misclarification, my question is for multi arg transform like bucket(a, b, c)
, which source id
's value should we write since it's now changed to required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can be strict on the write side ... unless we want to keep both fields indefinitely which I'm not sure is super beneficial?
@@ -540,7 +540,7 @@ Notes: | |||
2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. | |||
3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. | |||
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. | |||
|
|||
5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to be explicit about how to handle nulls and dropped columns. Should it be calculated as if hash(null) == 0
?
It also might be worth adding parentheses for those who don't recall the precedence order between XOR and modulo.
Also - good to meet you, and thanks for working on this! I'm a developer at Snowflake working on Iceberg functionality, and we really appreciate how detailed the spec is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the fact that bucket(null) returns null, if any value is null then the entire hash value should be null? I agree that this should be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to meet you as well @sfc-gh-bhannel, thanks for jumping in here.
It also might be worth adding parentheses for those who don't recall the precedence order between XOR and modulo.
Thanks, there was a bracket missing actually ;)
The spec states (just below the table):
All transforms must return
null
for anull
input value.
I think we can amend this for the multi-arg transforms that we leave out null
values, but we can still hash the other not-null fields.
5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. | |
5. For multi-argument bucketing, the hashes for the not-null input values are `xor`'ed: `(hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. The transform will return `null` when all input values are `null`. |
| | required | required | **`field-id`** | `JSON int` | 1000 | | ||
| required | required | required | **`name`** | `JSON string` | `id_bucket` | | ||
| required | required | required | **`transform`** | `JSON string` | `bucket[16]` | | ||
|
||
Notes: | ||
|
||
1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more consistent to always use source-ids for v3 writers? Or do you want to use source-id to allow v3 writers to be compatible with v2 readers if you don't use v3 features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the serialization logic much easier, since we don't need to know about the table version. Otherwise, we would need to plumb down the table version into the serializer.
In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms. | ||
Notes: | ||
|
||
1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to make the transform
row refer to this footnote? I would expect it to be the source-id
and source-ids
rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the idea:
But I guess it is not obvious? I think @wgtmac raised the same question in #12644 (comment)?
@@ -540,7 +540,7 @@ Notes: | |||
2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. | |||
3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. | |||
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. | |||
|
|||
5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to choose xor
but not other arithmetic operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took inspiration from Apache Hive. I think it plays well with the hashing.
@@ -540,7 +540,7 @@ Notes: | |||
2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters. | |||
3. Strings are truncated to a valid UTF-8 string with no more than `L` code points. | |||
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes. | |||
|
|||
5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the fact that bucket(null) returns null, if any value is null then the entire hash value should be null? I agree that this should be explicit.
@@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro | |||
|
|||
| V1 | V2 | V3 | Field | JSON representation | Example | | |||
|----------|----------|----------|------------------|---------------------|-------------| | |||
| required | required | required | **`transform`** | `JSON string` | `bucket[4]` | | |||
| required | required | omitted | **`source-id`** | `JSON int` | 1 | | |||
| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the note is in the wrong place?
@@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p | |||
|
|||
| V1 | V2 | V3 | Field | JSON representation | Example | | |||
|----------|----------|----------|------------------|---------------------|--------------| | |||
| required | required | omitted | **`source-id`** | `JSON int` | 1 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both source-ids
and source-id
are written for single-arg transform, a v3-supported reader can take a shortcut on source-ids
and only fall back to source-id
when source-ids
is missing. Isn't this a desirable feature? Though someone may argue that we have to check the consistency between them like the above discussion.
Co-authored-by: Gang Wu <[email protected]>
Changes around the multi-argument transforms, mainly two things: