Skip to content

Commit f3d1c4a

Browse files
committed
[python] Add block-level local disk cache for file reads
Introduce a CachingFileIO wrapper that transparently caches remote file reads at block granularity on local disk. Files are classified by FileType (ported from Java) and only META, GLOBAL_INDEX, BUCKET_INDEX types are cached; DATA and FILE_INDEX are read directly. Enable via table.copy({"file-cache.enabled": "true"}).
1 parent ec15528 commit f3d1c4a

8 files changed

Lines changed: 1136 additions & 1 deletion

File tree

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: "Local Disk Cache"
3+
weight: 7
4+
type: docs
5+
aliases:
6+
- /pypaimon/file-cache.html
7+
---
8+
<!--
9+
Licensed to the Apache Software Foundation (ASF) under one
10+
or more contributor license agreements. See the NOTICE file
11+
distributed with this work for additional information
12+
regarding copyright ownership. The ASF licenses this file
13+
to you under the Apache License, Version 2.0 (the
14+
"License"); you may not use this file except in compliance
15+
with the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing,
20+
software distributed under the License is distributed on an
21+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
22+
KIND, either express or implied. See the License for the
23+
specific language governing permissions and limitations
24+
under the License.
25+
-->
26+
27+
# Local Disk Cache
28+
29+
When reading files from remote storage (S3, OSS, HDFS, etc.), each seek+read goes over the network. PyPaimon provides a block-level local disk cache that transparently caches file reads on local disk, significantly reducing remote I/O for repeated access patterns.
30+
31+
## Cached File Types
32+
33+
The cache automatically classifies files by type and only caches the following:
34+
35+
| File Type | Examples | Cached |
36+
|-----------|----------|--------|
37+
| META | snapshot, schema, manifest, statistics, tag | Yes |
38+
| GLOBAL_INDEX | BTree, Lumina, Tantivy index files | Yes |
39+
| BUCKET_INDEX | Hash, deletion vector index files | Yes |
40+
| DATA | Data files (ORC, Parquet, etc.) | No |
41+
| FILE_INDEX | Data-file level bloom filter, bitmap | No |
42+
43+
Data files and file-level index files are typically large and accessed sequentially, so they are read directly without caching.
44+
45+
## Enable Cache
46+
47+
Use `table.copy()` to pass cache options as dynamic parameters:
48+
49+
```python
50+
table = catalog.get_table("db.my_table")
51+
52+
# Enable cache with dynamic options
53+
table = table.copy({
54+
"file-cache.enabled": "true",
55+
# optional: customize cache directory and limits
56+
"file-cache.dir": "/tmp/paimon-file-cache",
57+
"file-cache.max-size": "2gb",
58+
"file-cache.block-size": "1mb",
59+
})
60+
61+
# All subsequent reads on this table instance will use the cache
62+
```
63+
64+
## Cache Options
65+
66+
| Option | Type | Default | Description |
67+
|--------|------|---------|-------------|
68+
| `file-cache.enabled` | Boolean | false | Whether to enable local disk block cache. |
69+
| `file-cache.dir` | String | `<tmpdir>/paimon-file-cache` | Directory for storing cached blocks. |
70+
| `file-cache.max-size` | MemorySize | unlimited | Maximum total size of the cache. When exceeded, the least recently used blocks are evicted. |
71+
| `file-cache.block-size` | MemorySize | 1 mb | Block size for caching. Files are logically divided into fixed-size blocks and cached independently. |
72+
73+
## How It Works
74+
75+
- Files are logically divided into fixed-size blocks (default 1 MB).
76+
- On the first read, blocks are downloaded from remote storage and saved to local disk.
77+
- Subsequent reads of the same block are served from local disk, skipping remote I/O.
78+
- Cache files are keyed by remote file path and block offset, so they persist across process restarts and can be reused.
79+
- When the cache exceeds `max-size`, the least recently used blocks are evicted automatically.
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: "Global Index"
3+
weight: 6
4+
type: docs
5+
aliases:
6+
- /pypaimon/global-index.html
7+
---
8+
<!--
9+
Licensed to the Apache Software Foundation (ASF) under one
10+
or more contributor license agreements. See the NOTICE file
11+
distributed with this work for additional information
12+
regarding copyright ownership. The ASF licenses this file
13+
to you under the Apache License, Version 2.0 (the
14+
"License"); you may not use this file except in compliance
15+
with the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing,
20+
software distributed under the License is distributed on an
21+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
22+
KIND, either express or implied. See the License for the
23+
specific language governing permissions and limitations
24+
under the License.
25+
-->
26+
27+
# Global Index
28+
29+
PyPaimon supports querying global indexes built on Data Evolution (append) tables. Three index types are available:
30+
31+
- **BTree Index**: B-tree based index for scalar column lookups. Supports equality, IN, range, and combined predicates.
32+
- **Vector Index (Lumina)**: Approximate nearest neighbor (ANN) index for vector similarity search.
33+
- **Full-Text Index (Tantivy)**: Full-text search index for text retrieval with relevance scoring.
34+
35+
> Global indexes must be built beforehand (e.g., via Spark or Flink). See [Global Index]({{< ref "append-table/global-index" >}}) for how to create indexes.
36+
37+
## BTree Index
38+
39+
BTree index is automatically used during scan when a filter predicate matches the indexed column. No special API is needed — just set a filter on the read builder.
40+
41+
```python
42+
import pypaimon
43+
44+
catalog = pypaimon.create_catalog(...)
45+
table = catalog.get_table("db.my_table")
46+
47+
# BTree index is used automatically when filtering on indexed columns
48+
read_builder = table.new_read_builder()
49+
read_builder = read_builder.with_filter(
50+
pypaimon.PredicateBuilder(table.fields)
51+
.in_("name", ["a200", "a300"])
52+
)
53+
54+
scan = read_builder.new_scan()
55+
read = read_builder.new_read()
56+
splits = scan.plan().splits
57+
data = read.to_arrow(splits)
58+
```
59+
60+
Supported predicates: `equal`, `not_equal`, `less_than`, `less_or_equal`, `greater_than`, `greater_or_equal`, `in_`, `not_in`, `between`, `is_null`, `is_not_null`.
61+
62+
## Vector Index (Lumina)
63+
64+
Use `VectorSearchBuilder` to perform approximate nearest neighbor search on a vector column, then read the matched rows.
65+
66+
```python
67+
table = catalog.get_table("db.my_table")
68+
69+
# Step 1: vector search to get matching row IDs
70+
builder = table.new_vector_search_builder()
71+
index_result = (
72+
builder
73+
.with_vector_column("embedding")
74+
.with_query_vector([1.0, 2.0, 3.0, ...])
75+
.with_limit(10)
76+
.execute_local()
77+
)
78+
79+
# Step 2: read actual data for matched rows
80+
read_builder = table.new_read_builder()
81+
scan = read_builder.new_scan()
82+
scan.with_global_index_result(index_result)
83+
read = read_builder.new_read()
84+
data = read.to_arrow(scan.plan().splits)
85+
```
86+
87+
You can also add a scalar filter to pre-filter rows before vector search:
88+
89+
```python
90+
predicate = (
91+
pypaimon.PredicateBuilder(table.fields)
92+
.equal("category", "electronics")
93+
)
94+
95+
index_result = (
96+
table.new_vector_search_builder()
97+
.with_vector_column("embedding")
98+
.with_query_vector([1.0, 2.0, 3.0, ...])
99+
.with_limit(10)
100+
.with_filter(predicate)
101+
.execute_local()
102+
)
103+
104+
read_builder = table.new_read_builder()
105+
scan = read_builder.new_scan()
106+
scan.with_global_index_result(index_result)
107+
read = read_builder.new_read()
108+
data = read.to_arrow(scan.plan().splits)
109+
```
110+
111+
## Full-Text Index (Tantivy)
112+
113+
Use `FullTextSearchBuilder` to perform full-text search on a text column, then read the matched rows.
114+
115+
```python
116+
table = catalog.get_table("db.my_table")
117+
118+
# Step 1: full-text search to get matching row IDs
119+
builder = table.new_full_text_search_builder()
120+
index_result = (
121+
builder
122+
.with_text_column("content")
123+
.with_query_text("search keywords")
124+
.with_limit(20)
125+
.execute_local()
126+
)
127+
128+
# Step 2: read actual data for matched rows
129+
read_builder = table.new_read_builder()
130+
scan = read_builder.new_scan()
131+
scan.with_global_index_result(index_result)
132+
read = read_builder.new_read()
133+
data = read.to_arrow(scan.plan().splits)
134+
```
135+
136+
For better performance when reading from remote storage, consider enabling the [Local Disk Cache]({{< ref "pypaimon/file-cache" >}}).

paimon-python/pypaimon/common/options/core_options.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,6 +388,37 @@ class CoreOptions:
388388
)
389389
)
390390

391+
FILE_CACHE_ENABLED: ConfigOption[bool] = (
392+
ConfigOptions.key("file-cache.enabled")
393+
.boolean_type()
394+
.default_value(False)
395+
.with_description("Whether to enable local disk block cache for file reads.")
396+
)
397+
398+
FILE_CACHE_DIR: ConfigOption[str] = (
399+
ConfigOptions.key("file-cache.dir")
400+
.string_type()
401+
.no_default_value()
402+
.with_description(
403+
"Directory for file block cache. "
404+
"Defaults to a 'paimon-file-cache' subdirectory under the system temp directory."
405+
)
406+
)
407+
408+
FILE_CACHE_MAX_SIZE: ConfigOption[MemorySize] = (
409+
ConfigOptions.key("file-cache.max-size")
410+
.memory_type()
411+
.default_value(MemorySize.MAX_VALUE)
412+
.with_description("Maximum total size of the local disk block cache. Unlimited by default.")
413+
)
414+
415+
FILE_CACHE_BLOCK_SIZE: ConfigOption[MemorySize] = (
416+
ConfigOptions.key("file-cache.block-size")
417+
.memory_type()
418+
.default_value(MemorySize.of_mebi_bytes(1))
419+
.with_description("Block size for local disk cache.")
420+
)
421+
391422
READ_BATCH_SIZE: ConfigOption[int] = (
392423
ConfigOptions.key("read.batch-size")
393424
.int_type()
@@ -580,6 +611,18 @@ def global_index_enabled(self, default=None):
580611
def global_index_thread_num(self) -> Optional[int]:
581612
return self.options.get(CoreOptions.GLOBAL_INDEX_THREAD_NUM)
582613

614+
def file_cache_enabled(self) -> bool:
615+
return self.options.get(CoreOptions.FILE_CACHE_ENABLED)
616+
617+
def file_cache_dir(self) -> Optional[str]:
618+
return self.options.get(CoreOptions.FILE_CACHE_DIR)
619+
620+
def file_cache_max_size(self) -> MemorySize:
621+
return self.options.get(CoreOptions.FILE_CACHE_MAX_SIZE)
622+
623+
def file_cache_block_size(self) -> MemorySize:
624+
return self.options.get(CoreOptions.FILE_CACHE_BLOCK_SIZE)
625+
583626
def read_batch_size(self, default=None) -> int:
584627
return self.options.get(CoreOptions.READ_BATCH_SIZE, default or 1024)
585628

0 commit comments

Comments
 (0)