Skip to content

Commit 4f5b984

Browse files
committed
[python] Introduce Full Text Search and Tantivy index in Python
1 parent 7275b0a commit 4f5b984

File tree

26 files changed

+1557
-5
lines changed

26 files changed

+1557
-5
lines changed

.github/workflows/paimon-python-checks.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,20 @@ jobs:
7979
java -version
8080
mvn -version
8181
82+
- name: Install Rust toolchain
83+
uses: dtolnay/rust-toolchain@stable
84+
85+
- name: Build Tantivy native library
86+
run: |
87+
cd paimon-tantivy/paimon-tantivy-jni/rust
88+
cargo build --release
89+
90+
- name: Copy Tantivy native library to resources
91+
run: |
92+
RESOURCE_DIR=paimon-tantivy/paimon-tantivy-jni/src/main/resources/native/linux-amd64
93+
mkdir -p ${RESOURCE_DIR}
94+
cp paimon-tantivy/paimon-tantivy-jni/rust/target/release/libtantivy_jni.so ${RESOURCE_DIR}/
95+
8296
- name: Verify Python version
8397
run: python --version
8498

@@ -118,6 +132,17 @@ jobs:
118132
fi
119133
fi
120134
df -h
135+
136+
- name: Build and install tantivy-py from source
137+
if: matrix.python-version != '3.6.15'
138+
shell: bash
139+
run: |
140+
pip install maturin
141+
git clone -b support_directory https://github.com/JingsongLi/tantivy-py.git /tmp/tantivy-py
142+
cd /tmp/tantivy-py
143+
maturin build --release
144+
pip install target/wheels/tantivy-*.whl
145+
121146
- name: Run lint-python.sh
122147
shell: bash
123148
run: |

docs/content/append-table/global-index.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,4 +211,25 @@ try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)
211211
```
212212
{{< /tab >}}
213213

214+
{{< tab "Python SDK" >}}
215+
```python
216+
table = catalog.get_table('db.my_table')
217+
218+
# Step 1: Build full-text search
219+
builder = table.new_full_text_search_builder()
220+
builder.with_text_column('content')
221+
builder.with_query_text('paimon lake format')
222+
builder.with_limit(10)
223+
result = builder.execute_local()
224+
225+
# Step 2: Read matching rows using the search result
226+
read_builder = table.new_read_builder()
227+
scan = read_builder.new_scan().with_global_index_result(result)
228+
plan = scan.plan()
229+
table_read = read_builder.new_read()
230+
pa_table = table_read.to_arrow(plan.splits())
231+
print(pa_table)
232+
```
233+
{{< /tab >}}
234+
214235
{{< /tabs >}}

docs/content/pypaimon/cli.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,48 @@ Table 'mydb.old_name' renamed to 'mydb.new_name' successfully.
362362

363363
**Note:** Both filesystem and REST catalogs support table rename. For filesystem catalogs, the rename is performed by renaming the underlying table directory.
364364

365+
### Table Full-Text Search
366+
367+
Perform full-text search on a Paimon table with a Tantivy full-text index and display matching rows.
368+
369+
```shell
370+
paimon table full-text-search mydb.articles --column content --query "paimon lake"
371+
```
372+
373+
**Options:**
374+
375+
- `--column, -c`: Text column to search on - **Required**
376+
- `--query, -q`: Query text to search for - **Required**
377+
- `--limit, -l`: Maximum number of results to return (default: 10)
378+
- `--select, -s`: Select specific columns to display (comma-separated)
379+
- `--format, -f`: Output format: `table` (default) or `json`
380+
381+
**Examples:**
382+
383+
```shell
384+
# Basic full-text search
385+
paimon table full-text-search mydb.articles -c content -q "paimon lake"
386+
387+
# Search with limit
388+
paimon table full-text-search mydb.articles -c content -q "streaming data" -l 20
389+
390+
# Search with column projection
391+
paimon table full-text-search mydb.articles -c content -q "paimon" -s "id,title,content"
392+
393+
# Output as JSON
394+
paimon table full-text-search mydb.articles -c content -q "paimon" -f json
395+
```
396+
397+
Output:
398+
```
399+
id content
400+
0 Apache Paimon is a streaming data lake platform
401+
2 Paimon supports real-time data ingestion and...
402+
4 Data lake platforms like Paimon handle large-...
403+
```
404+
405+
**Note:** The table must have a Tantivy full-text index built on the target column. See [Global Index]({{< ref "append-table/global-index" >}}) for how to create full-text indexes.
406+
365407
### Table Drop
366408

367409
Drop a table from the catalog. This will permanently delete the table and all its data.

paimon-python/dev/run_mixed_tests.sh

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -138,9 +138,6 @@ run_java_read_test() {
138138

139139
cd "$PROJECT_ROOT"
140140

141-
PYTHON_VERSION=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')" 2>/dev/null || echo "unknown")
142-
echo "Detected Python version: $PYTHON_VERSION"
143-
144141
# Run Java test for Parquet/Orc/Avro format in paimon-core
145142
echo "Running Maven test for JavaPyE2ETest.testReadPkTable (Java Read Parquet/Orc/Avro)..."
146143
echo "Note: Maven may download dependencies on first run, this may take a while..."
@@ -171,6 +168,7 @@ run_java_read_test() {
171168
return 1
172169
fi
173170
}
171+
174172
run_pk_dv_test() {
175173
echo -e "${YELLOW}=== Step 5: Running Primary Key & Deletion Vector Test (testPKDeletionVectorWriteRead) ===${NC}"
176174

@@ -244,6 +242,30 @@ run_compressed_text_test() {
244242
fi
245243
}
246244

245+
# Function to run Tantivy full-text index test (Java write index, Python read and search)
246+
run_tantivy_fulltext_test() {
247+
echo -e "${YELLOW}=== Step 8: Running Tantivy Full-Text Index Test (Java Write, Python Read) ===${NC}"
248+
249+
cd "$PROJECT_ROOT"
250+
251+
echo "Running Maven test for JavaPyTantivyE2ETest.testTantivyFullTextIndexWrite..."
252+
if mvn test -Dtest=org.apache.paimon.tantivy.index.JavaPyTantivyE2ETest#testTantivyFullTextIndexWrite -pl paimon-tantivy/paimon-tantivy-index -q -Drun.e2e.tests=true; then
253+
echo -e "${GREEN}✓ Java test completed successfully${NC}"
254+
else
255+
echo -e "${RED}✗ Java test failed${NC}"
256+
return 1
257+
fi
258+
cd "$PAIMON_PYTHON_DIR"
259+
echo "Running Python test for JavaPyReadWriteTest.test_read_tantivy_full_text_index..."
260+
if python -m pytest java_py_read_write_test.py::JavaPyReadWriteTest::test_read_tantivy_full_text_index -v; then
261+
echo -e "${GREEN}✓ Python test completed successfully${NC}"
262+
return 0
263+
else
264+
echo -e "${RED}✗ Python test failed${NC}"
265+
return 1
266+
fi
267+
}
268+
247269
# Main execution
248270
main() {
249271
local java_write_result=0
@@ -253,6 +275,12 @@ main() {
253275
local pk_dv_result=0
254276
local btree_index_result=0
255277
local compressed_text_result=0
278+
local tantivy_fulltext_result=0
279+
280+
# Detect Python version
281+
PYTHON_VERSION=$(python -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')" 2>/dev/null || echo "unknown")
282+
PYTHON_MINOR=$(python -c "import sys; print(sys.version_info.minor)" 2>/dev/null || echo "0")
283+
echo "Detected Python version: $PYTHON_VERSION"
256284

257285
echo -e "${YELLOW}Starting mixed language test execution...${NC}"
258286
echo ""
@@ -311,6 +339,18 @@ main() {
311339

312340
echo ""
313341

342+
# Run Tantivy full-text index test (requires Python >= 3.10)
343+
if [[ "$PYTHON_MINOR" -ge 10 ]]; then
344+
if ! run_tantivy_fulltext_test; then
345+
tantivy_fulltext_result=1
346+
fi
347+
else
348+
echo -e "${YELLOW}⏭ Skipping Tantivy Full-Text Index Test (requires Python >= 3.10, current: $PYTHON_VERSION)${NC}"
349+
tantivy_fulltext_result=0
350+
fi
351+
352+
echo ""
353+
314354
echo -e "${YELLOW}=== Test Results Summary ===${NC}"
315355

316356
if [[ $java_write_result -eq 0 ]]; then
@@ -355,12 +395,18 @@ main() {
355395
echo -e "${RED}✗ Compressed Text Test (Java Write, Python Read): FAILED${NC}"
356396
fi
357397

398+
if [[ $tantivy_fulltext_result -eq 0 ]]; then
399+
echo -e "${GREEN}✓ Tantivy Full-Text Index Test (Java Write, Python Read): PASSED${NC}"
400+
else
401+
echo -e "${RED}✗ Tantivy Full-Text Index Test (Java Write, Python Read): FAILED${NC}"
402+
fi
403+
358404
echo ""
359405

360406
# Clean up warehouse directory after all tests
361407
cleanup_warehouse
362408

363-
if [[ $java_write_result -eq 0 && $python_read_result -eq 0 && $python_write_result -eq 0 && $java_read_result -eq 0 && $pk_dv_result -eq 0 && $btree_index_result -eq 0 && $compressed_text_result -eq 0 ]]; then
409+
if [[ $java_write_result -eq 0 && $python_read_result -eq 0 && $python_write_result -eq 0 && $java_read_result -eq 0 && $pk_dv_result -eq 0 && $btree_index_result -eq 0 && $compressed_text_result -eq 0 && $tantivy_fulltext_result -eq 0 ]]; then
364410
echo -e "${GREEN}🎉 All tests passed! Java-Python interoperability verified.${NC}"
365411
return 0
366412
else

paimon-python/pypaimon/cli/cli_table.py

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,83 @@ def cmd_table_read(args):
147147
print(df.to_string(index=False))
148148

149149

150+
def cmd_table_full_text_search(args):
151+
"""
152+
Execute the 'table full-text-search' command.
153+
154+
Performs full-text search on a Paimon table and displays matching rows.
155+
156+
Args:
157+
args: Parsed command line arguments.
158+
"""
159+
from pypaimon.cli.cli import load_catalog_config, create_catalog
160+
161+
config_path = args.config
162+
config = load_catalog_config(config_path)
163+
catalog = create_catalog(config)
164+
165+
table_identifier = args.table
166+
parts = table_identifier.split('.')
167+
if len(parts) != 2:
168+
print(f"Error: Invalid table identifier '{table_identifier}'. "
169+
f"Expected format: 'database.table'", file=sys.stderr)
170+
sys.exit(1)
171+
172+
database_name, table_name = parts
173+
174+
try:
175+
table = catalog.get_table(f"{database_name}.{table_name}")
176+
except Exception as e:
177+
print(f"Error: Failed to get table '{table_identifier}': {e}", file=sys.stderr)
178+
sys.exit(1)
179+
180+
# Build full-text search
181+
text_column = args.column
182+
query_text = args.query
183+
limit = args.limit
184+
185+
try:
186+
builder = table.new_full_text_search_builder()
187+
builder.with_text_column(text_column)
188+
builder.with_query_text(query_text)
189+
builder.with_limit(limit)
190+
result = builder.execute_local()
191+
except Exception as e:
192+
print(f"Error: Full-text search failed: {e}", file=sys.stderr)
193+
sys.exit(1)
194+
195+
if result.is_empty():
196+
print("No matching rows found.")
197+
return
198+
199+
# Read matching rows using global index result
200+
read_builder = table.new_read_builder()
201+
202+
select_columns = args.select
203+
if select_columns:
204+
projection = [col.strip() for col in select_columns.split(',')]
205+
available_fields = set(field.name for field in table.table_schema.fields)
206+
invalid_columns = [col for col in projection if col not in available_fields]
207+
if invalid_columns:
208+
print(f"Error: Column(s) {invalid_columns} do not exist in table '{table_identifier}'.",
209+
file=sys.stderr)
210+
sys.exit(1)
211+
read_builder = read_builder.with_projection(projection)
212+
213+
scan = read_builder.new_scan().with_global_index_result(result)
214+
plan = scan.plan()
215+
splits = plan.splits()
216+
read = read_builder.new_read()
217+
df = read.to_pandas(splits)
218+
219+
output_format = getattr(args, 'format', 'table')
220+
if output_format == 'json':
221+
import json
222+
print(json.dumps(df.to_dict(orient='records'), ensure_ascii=False))
223+
else:
224+
print(df.to_string(index=False))
225+
226+
150227
def cmd_table_get(args):
151228
"""
152229
Execute the 'table get' command.
@@ -773,6 +850,43 @@ def add_table_subcommands(table_parser):
773850

774851
# table rename command
775852
rename_parser = table_subparsers.add_parser('rename', help='Rename a table')
853+
854+
# table full-text-search command
855+
fts_parser = table_subparsers.add_parser('full-text-search', help='Full-text search on a table')
856+
fts_parser.add_argument(
857+
'table',
858+
help='Table identifier in format: database.table'
859+
)
860+
fts_parser.add_argument(
861+
'--column', '-c',
862+
required=True,
863+
help='Text column to search on'
864+
)
865+
fts_parser.add_argument(
866+
'--query', '-q',
867+
required=True,
868+
help='Query text to search for'
869+
)
870+
fts_parser.add_argument(
871+
'--limit', '-l',
872+
type=int,
873+
default=10,
874+
help='Maximum number of results to return (default: 10)'
875+
)
876+
fts_parser.add_argument(
877+
'--select', '-s',
878+
type=str,
879+
default=None,
880+
help='Select specific columns to display (comma-separated, e.g., "id,name,content")'
881+
)
882+
fts_parser.add_argument(
883+
'--format', '-f',
884+
type=str,
885+
choices=['table', 'json'],
886+
default='table',
887+
help='Output format: table (default) or json'
888+
)
889+
fts_parser.set_defaults(func=cmd_table_full_text_search)
776890
rename_parser.add_argument(
777891
'table',
778892
help='Source table identifier in format: database.table'

paimon-python/pypaimon/globalindex/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from pypaimon.globalindex.global_index_result import GlobalIndexResult
2020
from pypaimon.globalindex.global_index_reader import GlobalIndexReader, FieldRef
2121
from pypaimon.globalindex.vector_search import VectorSearch
22+
from pypaimon.globalindex.full_text_search import FullTextSearch
2223
from pypaimon.globalindex.vector_search_result import (
2324
ScoredGlobalIndexResult,
2425
DictBasedScoredIndexResult,
@@ -27,19 +28,22 @@
2728
from pypaimon.globalindex.global_index_meta import GlobalIndexMeta, GlobalIndexIOMeta
2829
from pypaimon.globalindex.global_index_evaluator import GlobalIndexEvaluator
2930
from pypaimon.globalindex.global_index_scanner import GlobalIndexScanner
31+
from pypaimon.globalindex.offset_global_index_reader import OffsetGlobalIndexReader
3032
from pypaimon.utils.range import Range
3133

3234
__all__ = [
3335
'GlobalIndexResult',
3436
'GlobalIndexReader',
3537
'FieldRef',
3638
'VectorSearch',
39+
'FullTextSearch',
3740
'ScoredGlobalIndexResult',
3841
'DictBasedScoredIndexResult',
3942
'ScoreGetter',
4043
'GlobalIndexMeta',
4144
'GlobalIndexIOMeta',
4245
'GlobalIndexEvaluator',
4346
'GlobalIndexScanner',
47+
'OffsetGlobalIndexReader',
4448
'Range',
4549
]

0 commit comments

Comments
 (0)