Skip to content

Commit 743f73d

Browse files
authored
Merge branch 'apache:master' into master
2 parents 6293891 + 347c45e commit 743f73d

533 files changed

Lines changed: 34514 additions & 2602 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/publish_snapshot-jdk17.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,8 @@ jobs:
6363
echo "<password>$ASF_PASSWORD</password>" >> $tmp_settings
6464
echo "</server></servers></settings>" >> $tmp_settings
6565
66-
mvn --settings $tmp_settings -ntp clean install -Dgpg.skip -Drat.skip -DskipTests -Papache-release,spark4,flink1 -pl org.apache.paimon:paimon-spark-4.0_2.13 -am
66+
mvn --settings $tmp_settings -ntp clean install -Dgpg.skip -Drat.skip -DskipTests -Papache-release,spark4,flink1 -pl org.apache.paimon:paimon-spark-4.0_2.13,org.apache.paimon:paimon-spark-4.1_2.13 -am
6767
# skip deploy paimon-spark-common_2.13 since they are already deployed in publish-snapshot.yml
68-
mvn --settings $tmp_settings -ntp clean deploy -Dgpg.skip -Drat.skip -DskipTests -Papache-release,spark4,flink1 -pl org.apache.paimon:paimon-spark4-common_2.13,org.apache.paimon:paimon-spark-ut_2.13,org.apache.paimon:paimon-spark-4.0_2.13
68+
mvn --settings $tmp_settings -ntp clean deploy -Dgpg.skip -Drat.skip -DskipTests -Papache-release,spark4,flink1 -pl org.apache.paimon:paimon-spark4-common_2.13,org.apache.paimon:paimon-spark-ut_2.13,org.apache.paimon:paimon-spark-4.0_2.13,org.apache.paimon:paimon-spark-4.1_2.13
6969
7070
rm $tmp_settings

.github/workflows/stale-pr.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Posts a single reminder comment on pull requests that have seen no
17+
# activity for 90 days. No auto-close; a maintainer decides whether to
18+
# close, ping again, or leave the PR open. Issues are not in scope.
19+
#
20+
# See dev@paimon.apache.org "Stale PR cleanup for Paimon" thread.
21+
22+
name: Stale PR reminder
23+
24+
on:
25+
schedule:
26+
- cron: '0 0 * * *'
27+
workflow_dispatch:
28+
29+
permissions:
30+
pull-requests: write
31+
32+
jobs:
33+
stale-pr:
34+
runs-on: ubuntu-latest
35+
steps:
36+
- uses: actions/stale@v9
37+
with:
38+
# PRs: nudge once at 90 days of inactivity, never auto-close.
39+
days-before-pr-stale: 90
40+
days-before-pr-close: -1
41+
stale-pr-label: stale
42+
stale-pr-message: >
43+
This pull request has had no activity for 90 days. If you'd
44+
like to keep it open, please push a new commit or leave a
45+
comment. Thanks for the contribution.
46+
remove-stale-when-updated: true
47+
48+
# Issues are not in scope for this workflow.
49+
days-before-issue-stale: -1
50+
days-before-issue-close: -1
51+
52+
operations-per-run: 100

.github/workflows/utitcase-rust-native.yml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,13 @@ jobs:
5151
distribution: 'temurin'
5252

5353
- name: Install Rust toolchain
54-
uses: dtolnay/rust-toolchain@stable
54+
run: |
55+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable --profile minimal
56+
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
5557
5658
- name: Clone and build Vortex native library
5759
run: |
58-
git clone --depth 1 https://github.com/spiraldb/vortex.git ${RUNNER_TEMP}/vortex
60+
git clone --depth 1 -b 0.69.0 https://github.com/spiraldb/vortex.git ${RUNNER_TEMP}/vortex
5961
cd ${RUNNER_TEMP}/vortex
6062
cargo build --package vortex-jni --release
6163
@@ -87,7 +89,9 @@ jobs:
8789
distribution: 'temurin'
8890

8991
- name: Install Rust toolchain
90-
uses: dtolnay/rust-toolchain@stable
92+
run: |
93+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable --profile minimal
94+
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
9195
9296
- name: Build Tantivy native library
9397
run: |

.github/workflows/utitcase-spark-4.x.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ jobs:
6161
jvm_timezone=$(random_timezone)
6262
echo "JVM timezone is set to $jvm_timezone"
6363
test_modules=""
64-
for suffix in ut 4.0; do
64+
for suffix in ut 4.0 4.1; do
6565
test_modules+="org.apache.paimon:paimon-spark-${suffix}_2.13,"
6666
done
6767
test_modules="${test_modules%,}"

docs/content/append-table/blob.md

Lines changed: 76 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ For details about the blob file format structure, see [File Format - BLOB]({{< r
7171

7272
## Storage Modes
7373

74-
Paimon supports three storage modes for BLOB fields:
74+
Paimon supports four storage modes for BLOB fields:
7575

7676
1. **Default blob storage**
7777
Blob bytes are written to Paimon-managed `.blob` files under the table path.
@@ -82,7 +82,10 @@ Paimon supports three storage modes for BLOB fields:
8282
3. **External-storage descriptor mode**
8383
Fields configured in `blob-external-storage-field` are a subset of `blob-descriptor-field`. At write time, Paimon writes the raw blob data to the configured `blob-external-storage-path` and stores only serialized `BlobDescriptor` bytes inline in data files.
8484

85-
This allows one table to mix raw-data BLOB fields, descriptor-only BLOB fields, and descriptor-based BLOB fields backed by external storage.
85+
4. **Blob view storage**
86+
Fields configured in `blob-view-field` store serialized `BlobViewStruct` bytes inline in data files. The struct points to a BLOB value in an upstream table by table identifier, BLOB field, and row id. The actual blob bytes are resolved from the upstream table at read time.
87+
88+
This allows one table to mix raw-data BLOB fields, descriptor-only BLOB fields, descriptor-based BLOB fields backed by external storage, and view fields that reference upstream BLOB values.
8689

8790
## Table Options
8891

@@ -123,6 +126,17 @@ This allows one table to mix raw-data BLOB fields, descriptor-only BLOB fields,
123126
some BLOB fields in <code>.blob</code> files and some as descriptor references.
124127
</td>
125128
</tr>
129+
<tr>
130+
<td><h5>blob-view-field</h5></td>
131+
<td>No</td>
132+
<td style="word-wrap: break-word;">(none)</td>
133+
<td>String</td>
134+
<td>
135+
Comma-separated BLOB field names stored as serialized <code>BlobViewStruct</code> bytes inline in normal data files.
136+
The field values reference BLOB values in upstream tables and are resolved at read time.
137+
This option must be a subset of <code>blob-field</code> and must not overlap with <code>blob-descriptor-field</code>.
138+
</td>
139+
</tr>
126140
<tr>
127141
<td><h5>blob-external-storage-field</h5></td>
128142
<td>No</td>
@@ -279,30 +293,75 @@ ALTER TABLE blob_table SET ('blob-as-descriptor' = 'false');
279293
SELECT image FROM blob_table;
280294
```
281295

282-
### External-Storage Descriptor Fields
296+
### Blob View
297+
298+
Blob view is useful when a downstream table should reference BLOB values already stored in an upstream table, without copying the bytes or creating new `.blob` files. A blob view field stores only a small `BlobViewStruct` inline. When the field is read, Paimon resolves the referenced BLOB from the upstream table.
299+
300+
Blob view requires:
283301

284-
If you want Paimon to accept raw BLOB input, write the data to an external location, and store only descriptor bytes inline, configure the target field(s) like this:
302+
- the upstream table to have row tracking enabled, so each row has a stable `_ROW_ID`
303+
- the downstream field to be listed in both `blob-field` and `blob-view-field`
304+
- writes to provide a serialized `BlobViewStruct`; in Flink SQL, use the built-in `sys.blob_view` function
305+
306+
The Flink SQL function signature is:
285307

286308
```sql
287-
'blob-descriptor-field' = 'image',
288-
'blob-external-storage-field' = 'image',
289-
'blob-external-storage-path' = 's3://my-bucket/paimon-external-blobs/'
309+
sys.blob_view(table_name, field_name, row_id)
290310
```
291311

292-
For these configured fields:
312+
Arguments:
313+
314+
- `table_name`: the upstream table name. It must be fully qualified as `database.table` or `catalog.database.table`. Unqualified table names are rejected.
315+
- `field_name`: the upstream BLOB field name.
316+
- `row_id`: the `_ROW_ID` value from the upstream row-tracking table.
317+
318+
The following example writes a downstream table whose `image_ref` field views the `image` field in `image_table`:
319+
320+
```sql
321+
CREATE TABLE image_table (
322+
id INT,
323+
name STRING,
324+
image BYTES
325+
) WITH (
326+
'row-tracking.enabled' = 'true',
327+
'data-evolution.enabled' = 'true',
328+
'blob-field' = 'image'
329+
);
330+
331+
CREATE TABLE image_view_table (
332+
id INT,
333+
label STRING,
334+
image_ref BYTES
335+
) WITH (
336+
'row-tracking.enabled' = 'true',
337+
'data-evolution.enabled' = 'true',
338+
'blob-field' = 'image_ref',
339+
'blob-view-field' = 'image_ref'
340+
);
341+
342+
INSERT INTO image_view_table
343+
SELECT
344+
id,
345+
name AS label,
346+
sys.blob_view('default.image_table', 'image', _ROW_ID)
347+
FROM `image_table$row_tracking`;
348+
```
293349

294-
- Paimon writes the raw blob data to `blob-external-storage-path`
295-
- Paimon stores serialized `BlobDescriptor` bytes inline in normal data files
296-
- the field remains descriptor-based when reading and updating
297-
- orphan file cleanup is not applied to the external storage path
350+
If the current Paimon catalog name is included in the table name, the function also accepts `catalog.database.table`:
351+
352+
```sql
353+
SELECT sys.blob_view('my_catalog.default.image_table', 'image', _ROW_ID)
354+
FROM `image_table$row_tracking`;
355+
```
356+
357+
Reads from `image_view_table.image_ref` return the referenced BLOB bytes in the same way as normal blob fields. The referenced upstream table and row must remain available for the view to be resolved.
298358

299359
### MERGE INTO Support
300360

301361
For Data Evolution writes in Flink and Spark:
302362

303363
- raw-data BLOB columns are still rejected in partial-column `MERGE INTO` updates
304364
- descriptor-based BLOB columns are allowed
305-
- fields configured in `blob-external-storage-field` are also allowed because they are descriptor-based fields
306365

307366
## Java API Usage
308367

@@ -661,6 +720,7 @@ For these configured fields:
661720
3. **No Statistics**: Statistics collection is not supported for blob columns.
662721
4. **Required Options**: `row-tracking.enabled` and `data-evolution.enabled` must be set to `true`.
663722
5. **External Storage Cleanup**: Files written through `blob-external-storage-path` are outside Paimon's orphan file cleanup scope.
723+
6. **Blob View Dependency**: Blob view fields depend on the referenced upstream table and row. If the upstream data is removed or no longer readable, the view cannot be resolved.
664724

665725
## Best Practices
666726

@@ -674,4 +734,6 @@ For these configured fields:
674734

675735
5. **Manage External Storage Lifecycle Separately**: Files written to `blob-external-storage-path` are not cleaned up by Paimon, so retention and deletion should be managed externally.
676736

677-
6. **Use Partitioning**: Partition your blob tables by date or other dimensions to improve query performance and data management.
737+
6. **Use Blob View to Avoid Copying BLOB Data**: Configure `blob-view-field` when a downstream table only needs to reference BLOB values from an upstream table.
738+
739+
7. **Use Partitioning**: Partition your blob tables by date or other dimensions to improve query performance and data management.

docs/content/append-table/global-index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ Global indexes work on top of Data Evolution tables. To use global indexes, your
4141
- `'row-tracking.enabled' = 'true'`
4242
- `'data-evolution.enabled' = 'true'`
4343

44+
> Global index queries may not be exact when the index only covers part of the table data. If a query predicate matches the index, Paimon returns only the results from the indexed portion. Matching records in data that has not been indexed yet will not be returned.
45+
4446
## Prerequisites
4547

4648
Create a table with the required properties:
@@ -95,11 +97,13 @@ Generation) applications.
9597
CALL sys.create_global_index(
9698
table => 'db.my_table',
9799
index_column => 'embedding',
98-
index_type => 'lumina-vector-ann',
100+
index_type => 'lumina',
99101
options => 'lumina.index.dimension=128'
100102
);
101103
```
102104

105+
The legacy index type `lumina-vector-ann` is still accepted for existing tables and SQL compatibility.
106+
103107
**Vector Search**
104108

105109
{{< tabs "vector-search" >}}

docs/content/concepts/system-tables.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -432,11 +432,11 @@ You can query the partition files of the table.
432432
SELECT * FROM my_table$partitions;
433433

434434
/*
435-
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+
436-
| partition | record_count | file_size_in_bytes| file_count | last_update_time | created_at | created_by | updated_by | options |
437-
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+
438-
| {1} | 1 | 645 | 1 | 2024-06-24 10:25:57 | 2024-06-24 10:20:00 | admin | test_user | {} |
439-
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+
435+
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+---------------+-------+
436+
| partition | record_count | file_size_in_bytes| file_count | last_update_time | created_at | created_by | updated_by | options | total_buckets | done |
437+
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+---------------+-------+
438+
| {1} | 1 | 645 | 1 | 2024-06-24 10:25:57 | 2024-06-24 10:20:00 | admin | test_user | {} | 1 | false |
439+
+-----------+--------------+-------------------+------------+---------------------+---------------------+------------+------------+---------+---------------+-------+
440440
*/
441441
```
442442

docs/content/learn-paimon/scenario-guide.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -451,14 +451,16 @@ Schema schema = Schema.newBuilder()
451451
CALL sys.create_global_index(
452452
table => 'db.doc_embeddings',
453453
index_column => 'embedding',
454-
index_type => 'lumina-vector-ann',
454+
index_type => 'lumina',
455455
options => 'lumina.index.dimension=768'
456456
);
457457

458458
-- Search for top-5 nearest neighbors
459459
SELECT * FROM vector_search('doc_embeddings', 'embedding', array(0.1f, 0.2f, ...), 5);
460460
```
461461

462+
The legacy index type `lumina-vector-ann` is still accepted for existing tables and SQL compatibility.
463+
462464
**Why:** The [Global Index]({{< ref "append-table/global-index" >}}) with DiskANN provides high-performance ANN search.
463465
Vector data is stored in dedicated `.vector.lance` files optimized for dense vectors, while scalar columns stay in
464466
Parquet. You can also build a **BTree Index** on scalar columns for efficient filtering:

docs/content/primary-key-table/pk-clustering-override.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,23 @@ CREATE TABLE my_table (
5050
);
5151
```
5252

53+
For `first-row` merge engine, deletion vectors are already built-in, so you don't need to enable them explicitly:
54+
55+
```sql
56+
CREATE TABLE my_table (
57+
id BIGINT,
58+
dt STRING,
59+
city STRING,
60+
amount DOUBLE,
61+
PRIMARY KEY (id) NOT ENFORCED
62+
) WITH (
63+
'pk-clustering-override' = 'true',
64+
'clustering.columns' = 'city',
65+
'merge-engine' = 'first-row',
66+
'bucket' = '4'
67+
);
68+
```
69+
5370
After this, data files within each bucket will be physically sorted by `city` instead of `id`. Queries like
5471
`SELECT * FROM my_table WHERE city = 'Beijing'` can skip irrelevant data files by checking their min/max statistics
5572
on the clustering column.
@@ -60,7 +77,7 @@ on the clustering column.
6077
|--------|-------------|
6178
| `pk-clustering-override` | `true` |
6279
| `clustering.columns` | Must be set (one or more non-primary-key columns) |
63-
| `deletion-vectors.enabled` | Must be `true` |
80+
| `deletion-vectors.enabled` | Must be `true` (not required for `first-row` merge engine) |
6481
| `merge-engine` | `deduplicate` (default) or `first-row` only |
6582

6683
## When to Use

docs/content/project/download.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ This documentation is a guide for downloading Paimon Jars.
4141
| Flink 1.17 | [paimon-flink-1.17-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-1.17/{{< version >}}/) |
4242
| Flink 1.16 | [paimon-flink-1.16-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-1.16/{{< version >}}/) |
4343
| Flink Action | [paimon-flink-action-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-flink-action/{{< version >}}/) |
44+
| Spark 4.1 | [paimon-spark-4.1_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-4.1_2.13/{{< version >}}/) |
4445
| Spark 4.0 | [paimon-spark-4.0_2.13-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-4.0_2.13/{{< version >}}/) |
4546
| Spark 3.5 | [paimon-spark-3.5_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.5_2.12/{{< version >}}/) |
4647
| Spark 3.4 | [paimon-spark-3.4_2.12-{{< version >}}.jar](https://repository.apache.org/snapshots/org/apache/paimon/paimon-spark-3.4_2.12/{{< version >}}/) |
@@ -68,6 +69,7 @@ This documentation is a guide for downloading Paimon Jars.
6869
| Flink 1.17 | [paimon-flink-1.17-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-flink-1.17/{{< version >}}/paimon-flink-1.17-{{< version >}}.jar) |
6970
| Flink 1.16 | [paimon-flink-1.16-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-flink-1.16/{{< version >}}/paimon-flink-1.16-{{< version >}}.jar) |
7071
| Flink Action | [paimon-flink-action-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-flink-action/{{< version >}}/paimon-flink-action-{{< version >}}.jar) |
72+
| Spark 4.1 | [paimon-spark-4.1_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-4.1_2.13/{{< version >}}/paimon-spark-4.1_2.13-{{< version >}}.jar) |
7173
| Spark 4.0 | [paimon-spark-4.0_2.13-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-4.0_2.13/{{< version >}}/paimon-spark-4.0_2.13-{{< version >}}.jar) |
7274
| Spark 3.5 | [paimon-spark-3.5_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.5_2.12/{{< version >}}/paimon-spark-3.5_2.12-{{< version >}}.jar) |
7375
| Spark 3.4 | [paimon-spark-3.4_2.12-{{< version >}}.jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-spark-3.4_2.12/{{< version >}}/paimon-spark-3.4_2.12-{{< version >}}.jar) |

0 commit comments

Comments
 (0)