Skip to content

Spec: Allow the use of source-id in V3 #12644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 22, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ
| Transform name | Description | Source types | Result type |
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------|
| **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`bucket[N]`** | Hash of value, mod `N` (see below) | Any combination of the following `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` |
| **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type |
| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` |
Expand Down Expand Up @@ -540,7 +540,7 @@ Notes:
2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters.
3. Strings are truncated to a valid UTF-8 string with no more than `L` code points.
4. In contrast to strings, binary values do not have an assumed encoding and are truncated to `L` bytes.

5. For multi-argument bucketing, the hashes are `xor`'ed: `hash(col1) ⊕ hash(col2) ⊕ ... ⊕ hash(colN)) % W`.

#### Partition Evolution

Expand Down Expand Up @@ -1414,12 +1414,16 @@ Each partition field in `fields` is stored as a JSON object with the following p

| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| required | required | required¹ | **`source-id`** | `JSON int` | 1 |
| | | required¹ | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| | required | required | **`field-id`** | `JSON int` | 1000 |
| required | required | required | **`name`** | `JSON string` | `id_bucket` |
| required | required | required | **`transform`** | `JSON string` | `bucket[16]` |

Notes:

1. For partition fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.

Supported partition transforms are listed below.

|Transform or Field|JSON representation|Example|
Expand Down Expand Up @@ -1453,13 +1457,15 @@ Each sort field in the fields list is stored as an object with the following pro

| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|-------------|
| required | required | required | **`transform`** | `JSON string` | `bucket[4]` |
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| required | required | required¹ | **`transform`** | `JSON string` | `bucket[4]` |
| required | required | required¹ | **`source-id`** | `JSON int` | 1 |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| required | required | required | **`direction`** | `JSON string` | `asc` |
| required | required | required | **`null-order`** | `JSON string` | `nulls-last`|

In v3 metadata, writers must use only `source-ids` because v3 requires reader support for multi-arg transforms.
Notes:

1. For sort fields with a transform with a single argument, the ID of the source field is set on `source-id`, and `source-ids` is omitted.

Older versions of the reference implementation can read tables with transforms unknown to it, ignoring them. But other implementations may break if they encounter unknown transforms. All v3 readers are required to read tables with unknown transforms, ignoring them.

Expand Down Expand Up @@ -1605,13 +1611,8 @@ All readers are required to read tables with unknown partition transforms, ignor
Writing v3 metadata:

* Partition Field and Sort Field JSON:
* `source-ids` was added and is required
* `source-id` is no longer required and should be omitted; always use `source-ids` instead

Reading v1 or v2 metadata for v3:

* Partition Field and Sort Field JSON:
* `source-ids` should default to a single-value list of the value of `source-id`
* `source-ids` was added and is required in case of multi-argument transforms.
* `source-id` should still be written in the case of single-argument transforms.

Row-level delete changes:

Expand Down