Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Allow the use of source-id in V3 #12644

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | Any combination of the following `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type |
| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
Expand Down Expand Up @@ -540,7 +540,7 @@ Notes:
2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters.
3. Strings are truncated to a valid UTF-8 string with no more than `L` code points.
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes.

5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to be explicit about how to handle nulls and dropped columns. Should it be calculated as if hash(null) == 0?

It also might be worth adding parentheses for those who don't recall the precedence order between XOR and modulo.

Also - good to meet you, and thanks for working on this! I'm a developer at Snowflake working on Iceberg functionality, and we really appreciate how detailed the spec is!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the fact that bucket(null) returns null, if any value is null then the entire hash value should be null? I agree that this should be explicit.

Copy link
Contributor Author

@Fokko Fokko Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to meet you as well @sfc-gh-bhannel, thanks for jumping in here.

It also might be worth adding parentheses for those who don't recall the precedence order between XOR and modulo.

Thanks, there was a bracket missing actually ;)

The spec states (just below the table):

All transforms must return null for a null input value.

I think we can amend this for the multi-arg transforms that we leave out null values, but we can still hash the other not-null fields.

Suggested change
5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`.
5. For multi-argument bucketing, the hashes for the not-null input values are `xor`'ed: `(hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`. The transform will return `null` when all input values are `null`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to choose xor but not other arithmetic operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from Apache Hive. I think it plays well with the hashing.


#### Partition Evolution

Expand Down Expand Up @@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p

| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we still have a reader which reads both source-id and source-ids? Basically in our motto of reading potentially out of spec but writing in spec, we can always check if source-id is set as well as source-ids. The only time we need to break is if the two don't match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the value be if source-ids is present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are implementation details but I would suggest if source-id and source-ids are both there
source-ids = list(source-id) - Fine on read, on write omit source-id
source-ids != list(source-id) - Error on read
source-id = x && source-ids == null - Fine on read, promote to list(source-id) on write
source-id = null && source-ids = list - Spec compliant config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that we should break if they doesn't match. We have a similar situation with schema and schemas with current-schema-id that might point to two different schemas.

When we write both, we can resolve that, but I think we don't need them both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my suggestion is that we still say in V3 that source-id is omitted but in the implementation we can still handle an out of spec metadata if we come across one. That way the spec reader code doesn't have to be version dependent.

Copy link
Contributor Author

@Fokko Fokko Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the confusion. Some examples. In the case of a single arg:

{"field-id": 1000, "source-id": 1, "name": "bucket_user_id", "transform": "bucket[16]"} 

In the case of a multi-arg:

{"field-id": 1000, "source-ids": [1, 2], "name": "bucket_user_id", "transform": "bucket[16]"} 

The benefit is that we don't have to carry the format-version to the serializers. So, they are mutually exclusive, and one of them is required, indicated by the ¹:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we would need the format version though? Can't we always just read both as I stated above? Then we only write source-ids. That would also be version independent

Copy link
Contributor Author

@Fokko Fokko Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source-ids is format-version≥3, for v{1,2} you need to write source-id, otherwise you might break v1 and v2 readers that are not aware of source-ids.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you meant that we need the format version for the write path! Got it, I thought we were talking about the read side, let me think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clearer! Give it some time to simmer and let us know :)

| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| required | required | required¹ | **`source-id`** | `JSON int` | 1 |
| | | required¹ | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| | required | required | **`field-id`** | `JSON int` | 1000 |
| required | required | required | **`name`** | `JSON string` | `id_bucket` |
| required | required | required | **`transform`** | `JSON string` | `bucket[16]` |

Notes:

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more consistent to always use source-ids for v3 writers? Or do you want to use source-id to allow v3 writers to be compatible with v2 readers if you don't use v3 features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the serialization logic much easier, since we don't need to know about the table version. Otherwise, we would need to plumb down the table version into the serializer.


Supported partition transforms are listed below.

|Transform or Field|JSON representation|Example|
Expand Down Expand Up @@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro

| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|-------------|
| required | required | required | **`transform`** | `JSON string` | `bucket[4]` |
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the note is in the wrong place?

| required | required | required¹ | **`source-id`** | `JSON int` | 1 |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| required | required | required | **`direction`** | `JSON string` | `asc` |
| required | required | required | **`null-order`** | `JSON string` | `nulls-last`|

In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms.
Notes:

1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to make the transform row refer to this footnote? I would expect it to be the source-id and source-ids rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the idea:

image

But I guess it is not obvious? I think @wgtmac raised the same question in #12644 (comment)?


Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them.

Expand Down Expand Up @@ -1605,13 +1611,8 @@ All readers are required to read tables with unknown partition transforms, ignor
Writing v3 metadata:

* Partition Field and Sort Field JSON:
* `source-ids` was added and is required
* `source-id` is no longer required and should be omitted; always use `source-ids` instead

Reading v1 or v2 metadata for v3:

* Partition Field and Sort Field JSON:
* `source-ids` should default to a single-value list of the value of `source-id`
* `source-ids` was added and is required in case of multi-argument transforms.
* `source-id` should still be written in the case of single-argument transforms.

Row-level delete changes:

Expand Down