|
| 1 | +# Paimon Tantivy Index |
| 2 | + |
| 3 | +Full-text search global index for Apache Paimon, powered by [Tantivy](https://github.com/quickwit-oss/tantivy) (a Rust full-text search engine). |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This module provides full-text search capabilities for Paimon's Data Evolution (append) tables through the Global Index framework. It consists of two sub-modules: |
| 8 | + |
| 9 | +- **paimon-tantivy-jni**: Rust/JNI bridge that wraps Tantivy's indexing and search APIs as native methods callable from Java. |
| 10 | +- **paimon-tantivy-index**: Java integration layer that implements Paimon's `GlobalIndexer` SPI, handling index building, archive packing, and query execution. |
| 11 | + |
| 12 | +### Architecture |
| 13 | + |
| 14 | +``` |
| 15 | +┌─────────────────────────────────────────────────────┐ |
| 16 | +│ Paimon Engine │ |
| 17 | +│ (FullTextSearchBuilder / FullTextScan / FullTextRead)│ |
| 18 | +└──────────────────────┬──────────────────────────────┘ |
| 19 | + │ GlobalIndexer SPI |
| 20 | +┌──────────────────────▼──────────────────────────────┐ |
| 21 | +│ paimon-tantivy-index │ |
| 22 | +│ TantivyFullTextGlobalIndexWriter (build index) │ |
| 23 | +│ TantivyFullTextGlobalIndexReader (search index) │ |
| 24 | +└──────────────────────┬──────────────────────────────┘ |
| 25 | + │ JNI |
| 26 | +┌──────────────────────▼──────────────────────────────┐ |
| 27 | +│ paimon-tantivy-jni │ |
| 28 | +│ TantivyIndexWriter (write docs via JNI) │ |
| 29 | +│ TantivySearcher (search via JNI / stream I/O) │ |
| 30 | +└──────────────────────┬──────────────────────────────┘ |
| 31 | + │ FFI |
| 32 | +┌──────────────────────▼──────────────────────────────┐ |
| 33 | +│ Rust (lib.rs + jni_directory.rs) │ |
| 34 | +│ Tantivy index writer / reader / query parser │ |
| 35 | +└─────────────────────────────────────────────────────┘ |
| 36 | +``` |
| 37 | + |
| 38 | +### Index Schema |
| 39 | + |
| 40 | +Tantivy index uses a fixed two-field schema: |
| 41 | + |
| 42 | +| Field | Tantivy Type | Description | |
| 43 | +|-----------|-------------|--------------------------------------------------| |
| 44 | +| `row_id` | u64 (stored, indexed) | Paimon's global row ID, used to map search results back to table rows | |
| 45 | +| `text` | TEXT (tokenized, indexed) | The text content from the indexed column | |
| 46 | + |
| 47 | +## Archive File Format |
| 48 | + |
| 49 | +The writer produces a **single archive file** that bundles all Tantivy segment files into one sequential stream. This format is designed to be stored on any Paimon-supported file system (HDFS, S3, OSS, etc.) and read back without extracting to local disk. |
| 50 | + |
| 51 | +### Layout |
| 52 | + |
| 53 | +All integers are **big-endian**. |
| 54 | + |
| 55 | +``` |
| 56 | +┌─────────────────────────────────────────────────┐ |
| 57 | +│ File Count (4 bytes, int32) │ |
| 58 | +├─────────────────────────────────────────────────┤ |
| 59 | +│ File Entry 1 │ |
| 60 | +│ ┌─────────────────────────────────────────────┐│ |
| 61 | +│ │ Name Length (4 bytes, int32) ││ |
| 62 | +│ │ Name (N bytes, UTF-8) ││ |
| 63 | +│ │ Data Length (8 bytes, int64) ││ |
| 64 | +│ │ Data (M bytes, raw) ││ |
| 65 | +│ └─────────────────────────────────────────────┘│ |
| 66 | +├─────────────────────────────────────────────────┤ |
| 67 | +│ File Entry 2 │ |
| 68 | +│ ┌─────────────────────────────────────────────┐│ |
| 69 | +│ │ Name Length (4 bytes, int32) ││ |
| 70 | +│ │ Name (N bytes, UTF-8) ││ |
| 71 | +│ │ Data Length (8 bytes, int64) ││ |
| 72 | +│ │ Data (M bytes, raw) ││ |
| 73 | +│ └─────────────────────────────────────────────┘│ |
| 74 | +├─────────────────────────────────────────────────┤ |
| 75 | +│ ... │ |
| 76 | +└─────────────────────────────────────────────────┘ |
| 77 | +``` |
| 78 | + |
| 79 | +### Field Details |
| 80 | + |
| 81 | +| Field | Size | Type | Description | |
| 82 | +|-------------|---------|--------|------------------------------------------------| |
| 83 | +| File Count | 4 bytes | int32 | Number of files in the archive | |
| 84 | +| Name Length | 4 bytes | int32 | Byte length of the file name | |
| 85 | +| Name | N bytes | UTF-8 | Tantivy segment file name (e.g. `meta.json`, `*.term`, `*.pos`, `*.store`) | |
| 86 | +| Data Length | 8 bytes | int64 | Byte length of the file data | |
| 87 | +| Data | M bytes | raw | Raw file content | |
| 88 | + |
| 89 | +### Write Path |
| 90 | + |
| 91 | +1. `TantivyFullTextGlobalIndexWriter` receives text values via `write(Object)`, one per row. |
| 92 | +2. Each non-null text is passed to `TantivyIndexWriter` (JNI) as `addDocument(rowId, text)`, where `rowId` is a 0-based sequential counter. |
| 93 | +3. On `finish()`, the Tantivy index is committed and all files in the local temp directory are packed into the archive format above. |
| 94 | +4. The archive is written as a single file to Paimon's global index file system. |
| 95 | +5. The local temp directory is deleted. |
| 96 | + |
| 97 | +### Read Path |
| 98 | + |
| 99 | +1. `TantivyFullTextGlobalIndexReader` opens the archive file as a `SeekableInputStream`. |
| 100 | +2. The archive header is parsed to build a file layout table (name → offset, length). |
| 101 | +3. A `TantivySearcher` is created with the layout and a `StreamFileInput` callback — Tantivy reads file data on demand via JNI callbacks to `seek()` + `read()` on the stream. No temp files are created. |
| 102 | +4. Search queries are executed via Tantivy's `QueryParser` with BM25 scoring, returning `(rowId, score)` pairs. |
| 103 | + |
| 104 | +## Usage |
| 105 | + |
| 106 | +### Build Index |
| 107 | + |
| 108 | +```sql |
| 109 | +CALL sys.create_global_index( |
| 110 | + table => 'db.my_table', |
| 111 | + index_column => 'content', |
| 112 | + index_type => 'tantivy-fulltext' |
| 113 | +); |
| 114 | +``` |
| 115 | + |
| 116 | +### Search |
| 117 | + |
| 118 | +```sql |
| 119 | +SELECT * FROM full_text_search('my_table', 'content', 'search query', 10); |
| 120 | +``` |
| 121 | + |
| 122 | +### Java API |
| 123 | + |
| 124 | +```java |
| 125 | +Table table = catalog.getTable(identifier); |
| 126 | + |
| 127 | +GlobalIndexResult result = table.newFullTextSearchBuilder() |
| 128 | + .withQueryText("search query") |
| 129 | + .withLimit(10) |
| 130 | + .withTextColumn("content") |
| 131 | + .executeLocal(); |
| 132 | + |
| 133 | +ReadBuilder readBuilder = table.newReadBuilder(); |
| 134 | +TableScan.Plan plan = readBuilder.newScan() |
| 135 | + .withGlobalIndexResult(result).plan(); |
| 136 | +try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) { |
| 137 | + reader.forEachRemaining(row -> System.out.println(row)); |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +## SPI Registration |
| 142 | + |
| 143 | +The index type `tantivy-fulltext` is registered via Java SPI: |
| 144 | + |
| 145 | +``` |
| 146 | +META-INF/services/org.apache.paimon.globalindex.GlobalIndexerFactory |
| 147 | + → org.apache.paimon.tantivy.index.TantivyFullTextGlobalIndexerFactory |
| 148 | +``` |
0 commit comments