Skip to content

Commit 3615eed

Browse files
committed
[tantivy] Introduce paimon-tantivy for full text search index
1 parent 2e1d569 commit 3615eed

File tree

27 files changed

+2665
-3
lines changed

27 files changed

+2665
-3
lines changed

.github/workflows/utitcase-vortex.yml renamed to .github/workflows/utitcase-rust-native.yml

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,17 @@
1616
# limitations under the License.
1717
################################################################################
1818

19-
name: UTCase Vortex
19+
name: UTCase Rust Native
2020

2121
on:
2222
push:
2323
paths:
2424
- 'paimon-vortex/**'
25+
- 'paimon-tantivy/**'
2526
pull_request:
2627
paths:
2728
- 'paimon-vortex/**'
29+
- 'paimon-tantivy/**'
2830

2931
env:
3032
JDK_VERSION: 8
@@ -70,3 +72,38 @@ jobs:
7072
mvn -B -ntp verify -pl paimon-vortex/paimon-vortex-jni,paimon-vortex/paimon-vortex-format -Dcheckstyle.skip=true -Dspotless.check.skip=true
7173
env:
7274
MAVEN_OPTS: -Xmx4096m
75+
76+
tantivy_test:
77+
runs-on: ubuntu-latest
78+
79+
steps:
80+
- name: Checkout code
81+
uses: actions/checkout@v6
82+
83+
- name: Set up JDK ${{ env.JDK_VERSION }}
84+
uses: actions/setup-java@v5
85+
with:
86+
java-version: ${{ env.JDK_VERSION }}
87+
distribution: 'temurin'
88+
89+
- name: Install Rust toolchain
90+
uses: dtolnay/rust-toolchain@stable
91+
92+
- name: Build Tantivy native library
93+
run: |
94+
cd paimon-tantivy/paimon-tantivy-jni/rust
95+
cargo build --release
96+
97+
- name: Copy native library to resources
98+
run: |
99+
RESOURCE_DIR=paimon-tantivy/paimon-tantivy-jni/src/main/resources/native/linux-amd64
100+
mkdir -p ${RESOURCE_DIR}
101+
cp paimon-tantivy/paimon-tantivy-jni/rust/target/release/libtantivy_jni.so ${RESOURCE_DIR}/
102+
103+
- name: Build and test Tantivy modules
104+
timeout-minutes: 30
105+
run: |
106+
mvn -T 2C -B -ntp clean install -DskipTests
107+
mvn -B -ntp verify -pl paimon-tantivy/paimon-tantivy-jni,paimon-tantivy/paimon-tantivy-index -Dcheckstyle.skip=true -Dspotless.check.skip=true
108+
env:
109+
MAVEN_OPTS: -Xmx4096m

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ paimon-python/dev/log
4444
*.swp
4545
.cache
4646

47+
### Rust ###
48+
Cargo.lock
4749
### Vortex lib ###
48-
4950
*libvortex_jni*
51+
### Tantivy lib ###
52+
*libtantivy_jni*

docs/content/append-table/global-index.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ without full-table scans. Paimon supports multiple global index types:
3333

3434
- **BTree Index**: A B-tree based index for scalar column lookups. Supports equality, IN, range predicates, and can be combined across multiple columns with AND/OR logic.
3535
- **Vector Index**: An approximate nearest neighbor (ANN) index powered by DiskANN for vector similarity search.
36+
- **Full-Text Index**: A full-text search index powered by Tantivy for text retrieval. Supports term matching and relevance scoring.
3637

3738
Global indexes work on top of Data Evolution tables. To use global indexes, your table **must** have:
3839

@@ -48,7 +49,8 @@ Create a table with the required properties:
4849
CREATE TABLE my_table (
4950
id INT,
5051
name STRING,
51-
embedding ARRAY<FLOAT>
52+
embedding ARRAY<FLOAT>,
53+
content STRING
5254
) TBLPROPERTIES (
5355
'bucket' = '-1',
5456
'row-tracking.enabled' = 'true',
@@ -133,3 +135,54 @@ try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)
133135
{{< /tab >}}
134136

135137
{{< /tabs >}}
138+
139+
## Full-Text Index
140+
141+
Full-Text Index provides text search capabilities powered by Tantivy. It is suitable for text retrieval scenarios
142+
such as document search, log analysis, and content-based filtering.
143+
144+
**Build Full-Text Index**
145+
146+
```sql
147+
-- Create full-text index on 'content' column
148+
CALL sys.create_global_index(
149+
table => 'db.my_table',
150+
index_column => 'content',
151+
index_type => 'tantivy-fulltext'
152+
);
153+
```
154+
155+
**Full-Text Search**
156+
157+
{{< tabs "fulltext-search" >}}
158+
159+
{{< tab "Spark SQL" >}}
160+
```sql
161+
-- Search for top-10 documents matching the query
162+
SELECT * FROM full_text_search('my_table', 'content', 'paimon lake format', 10);
163+
```
164+
{{< /tab >}}
165+
166+
{{< tab "Java API" >}}
167+
```java
168+
Table table = catalog.getTable(identifier);
169+
170+
// Step 1: Build full-text search
171+
GlobalIndexResult result = table.newFullTextSearchBuilder()
172+
.withQueryText("paimon lake format")
173+
.withLimit(10)
174+
.withTextColumn("content")
175+
.executeLocal();
176+
177+
// Step 2: Read matching rows using the search result
178+
ReadBuilder readBuilder = table.newReadBuilder();
179+
TableScan.Plan plan = readBuilder.newScan().withGlobalIndexResult(result).plan();
180+
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
181+
reader.forEachRemaining(row -> {
182+
System.out.println("id=" + row.getInt(0) + ", content=" + row.getString(1));
183+
});
184+
}
185+
```
186+
{{< /tab >}}
187+
188+
{{< /tabs >}}
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Paimon Tantivy Index
2+
3+
Full-text search global index for Apache Paimon, powered by [Tantivy](https://github.com/quickwit-oss/tantivy) (a Rust full-text search engine).
4+
5+
## Overview
6+
7+
This module provides full-text search capabilities for Paimon's Data Evolution (append) tables through the Global Index framework. It consists of two sub-modules:
8+
9+
- **paimon-tantivy-jni**: Rust/JNI bridge that wraps Tantivy's indexing and search APIs as native methods callable from Java.
10+
- **paimon-tantivy-index**: Java integration layer that implements Paimon's `GlobalIndexer` SPI, handling index building, archive packing, and query execution.
11+
12+
### Architecture
13+
14+
```
15+
┌─────────────────────────────────────────────────────┐
16+
│ Paimon Engine │
17+
│ (FullTextSearchBuilder / FullTextScan / FullTextRead)│
18+
└──────────────────────┬──────────────────────────────┘
19+
│ GlobalIndexer SPI
20+
┌──────────────────────▼──────────────────────────────┐
21+
│ paimon-tantivy-index │
22+
│ TantivyFullTextGlobalIndexWriter (build index) │
23+
│ TantivyFullTextGlobalIndexReader (search index) │
24+
└──────────────────────┬──────────────────────────────┘
25+
│ JNI
26+
┌──────────────────────▼──────────────────────────────┐
27+
│ paimon-tantivy-jni │
28+
│ TantivyIndexWriter (write docs via JNI) │
29+
│ TantivySearcher (search via JNI / stream I/O) │
30+
└──────────────────────┬──────────────────────────────┘
31+
│ FFI
32+
┌──────────────────────▼──────────────────────────────┐
33+
│ Rust (lib.rs + jni_directory.rs) │
34+
│ Tantivy index writer / reader / query parser │
35+
└─────────────────────────────────────────────────────┘
36+
```
37+
38+
### Index Schema
39+
40+
Tantivy index uses a fixed two-field schema:
41+
42+
| Field | Tantivy Type | Description |
43+
|-----------|-------------|--------------------------------------------------|
44+
| `row_id` | u64 (stored, indexed) | Paimon's global row ID, used to map search results back to table rows |
45+
| `text` | TEXT (tokenized, indexed) | The text content from the indexed column |
46+
47+
## Archive File Format
48+
49+
The writer produces a **single archive file** that bundles all Tantivy segment files into one sequential stream. This format is designed to be stored on any Paimon-supported file system (HDFS, S3, OSS, etc.) and read back without extracting to local disk.
50+
51+
### Layout
52+
53+
All integers are **big-endian**.
54+
55+
```
56+
┌─────────────────────────────────────────────────┐
57+
│ File Count (4 bytes, int32) │
58+
├─────────────────────────────────────────────────┤
59+
│ File Entry 1 │
60+
│ ┌─────────────────────────────────────────────┐│
61+
│ │ Name Length (4 bytes, int32) ││
62+
│ │ Name (N bytes, UTF-8) ││
63+
│ │ Data Length (8 bytes, int64) ││
64+
│ │ Data (M bytes, raw) ││
65+
│ └─────────────────────────────────────────────┘│
66+
├─────────────────────────────────────────────────┤
67+
│ File Entry 2 │
68+
│ ┌─────────────────────────────────────────────┐│
69+
│ │ Name Length (4 bytes, int32) ││
70+
│ │ Name (N bytes, UTF-8) ││
71+
│ │ Data Length (8 bytes, int64) ││
72+
│ │ Data (M bytes, raw) ││
73+
│ └─────────────────────────────────────────────┘│
74+
├─────────────────────────────────────────────────┤
75+
│ ... │
76+
└─────────────────────────────────────────────────┘
77+
```
78+
79+
### Field Details
80+
81+
| Field | Size | Type | Description |
82+
|-------------|---------|--------|------------------------------------------------|
83+
| File Count | 4 bytes | int32 | Number of files in the archive |
84+
| Name Length | 4 bytes | int32 | Byte length of the file name |
85+
| Name | N bytes | UTF-8 | Tantivy segment file name (e.g. `meta.json`, `*.term`, `*.pos`, `*.store`) |
86+
| Data Length | 8 bytes | int64 | Byte length of the file data |
87+
| Data | M bytes | raw | Raw file content |
88+
89+
### Write Path
90+
91+
1. `TantivyFullTextGlobalIndexWriter` receives text values via `write(Object)`, one per row.
92+
2. Each non-null text is passed to `TantivyIndexWriter` (JNI) as `addDocument(rowId, text)`, where `rowId` is a 0-based sequential counter.
93+
3. On `finish()`, the Tantivy index is committed and all files in the local temp directory are packed into the archive format above.
94+
4. The archive is written as a single file to Paimon's global index file system.
95+
5. The local temp directory is deleted.
96+
97+
### Read Path
98+
99+
1. `TantivyFullTextGlobalIndexReader` opens the archive file as a `SeekableInputStream`.
100+
2. The archive header is parsed to build a file layout table (name → offset, length).
101+
3. A `TantivySearcher` is created with the layout and a `StreamFileInput` callback — Tantivy reads file data on demand via JNI callbacks to `seek()` + `read()` on the stream. No temp files are created.
102+
4. Search queries are executed via Tantivy's `QueryParser` with BM25 scoring, returning `(rowId, score)` pairs.
103+
104+
## Usage
105+
106+
### Build Index
107+
108+
```sql
109+
CALL sys.create_global_index(
110+
table => 'db.my_table',
111+
index_column => 'content',
112+
index_type => 'tantivy-fulltext'
113+
);
114+
```
115+
116+
### Search
117+
118+
```sql
119+
SELECT * FROM full_text_search('my_table', 'content', 'search query', 10);
120+
```
121+
122+
### Java API
123+
124+
```java
125+
Table table = catalog.getTable(identifier);
126+
127+
GlobalIndexResult result = table.newFullTextSearchBuilder()
128+
.withQueryText("search query")
129+
.withLimit(10)
130+
.withTextColumn("content")
131+
.executeLocal();
132+
133+
ReadBuilder readBuilder = table.newReadBuilder();
134+
TableScan.Plan plan = readBuilder.newScan()
135+
.withGlobalIndexResult(result).plan();
136+
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
137+
reader.forEachRemaining(row -> System.out.println(row));
138+
}
139+
```
140+
141+
## SPI Registration
142+
143+
The index type `tantivy-fulltext` is registered via Java SPI:
144+
145+
```
146+
META-INF/services/org.apache.paimon.globalindex.GlobalIndexerFactory
147+
→ org.apache.paimon.tantivy.index.TantivyFullTextGlobalIndexerFactory
148+
```
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
Licensed to the Apache Software Foundation (ASF) under one
4+
or more contributor license agreements. See the NOTICE file
5+
distributed with this work for additional information
6+
regarding copyright ownership. The ASF licenses this file
7+
to you under the Apache License, Version 2.0 (the
8+
"License"); you may not use this file except in compliance
9+
with the License. You may obtain a copy of the License at
10+
11+
http://www.apache.org/licenses/LICENSE-2.0
12+
13+
Unless required by applicable law or agreed to in writing,
14+
software distributed under the License is distributed on an
15+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
KIND, either express or implied. See the License for the
17+
specific language governing permissions and limitations
18+
under the License.
19+
-->
20+
<project xmlns="http://maven.apache.org/POM/4.0.0"
21+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
22+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
23+
<modelVersion>4.0.0</modelVersion>
24+
25+
<parent>
26+
<artifactId>paimon-tantivy</artifactId>
27+
<groupId>org.apache.paimon</groupId>
28+
<version>1.5-SNAPSHOT</version>
29+
</parent>
30+
31+
<artifactId>paimon-tantivy-index</artifactId>
32+
<name>Paimon : Tantivy Index</name>
33+
34+
<dependencies>
35+
<dependency>
36+
<groupId>org.apache.paimon</groupId>
37+
<artifactId>paimon-tantivy-jni</artifactId>
38+
<version>${project.version}</version>
39+
</dependency>
40+
41+
<dependency>
42+
<groupId>org.apache.paimon</groupId>
43+
<artifactId>paimon-common</artifactId>
44+
<version>${project.version}</version>
45+
<scope>provided</scope>
46+
</dependency>
47+
48+
<!-- test dependencies -->
49+
<dependency>
50+
<groupId>org.junit.jupiter</groupId>
51+
<artifactId>junit-jupiter</artifactId>
52+
<version>${junit5.version}</version>
53+
<scope>test</scope>
54+
</dependency>
55+
56+
<dependency>
57+
<groupId>org.apache.paimon</groupId>
58+
<artifactId>paimon-core</artifactId>
59+
<version>${project.version}</version>
60+
<scope>test</scope>
61+
</dependency>
62+
63+
<dependency>
64+
<groupId>org.apache.paimon</groupId>
65+
<artifactId>paimon-format</artifactId>
66+
<version>${project.version}</version>
67+
<scope>test</scope>
68+
</dependency>
69+
70+
<dependency>
71+
<groupId>org.apache.paimon</groupId>
72+
<artifactId>paimon-test-utils</artifactId>
73+
<version>${project.version}</version>
74+
<scope>test</scope>
75+
</dependency>
76+
</dependencies>
77+
78+
<build>
79+
<plugins>
80+
<plugin>
81+
<groupId>org.apache.maven.plugins</groupId>
82+
<artifactId>maven-surefire-plugin</artifactId>
83+
<configuration>
84+
<forkCount>1</forkCount>
85+
<redirectTestOutputToFile>true</redirectTestOutputToFile>
86+
<parallel>none</parallel>
87+
</configuration>
88+
</plugin>
89+
</plugins>
90+
</build>
91+
</project>

0 commit comments

Comments
 (0)