Skip to content

Commit 53d3a6f

Browse files
committed
feat(cache): support manifest file cache
1 parent 0dbbda6 commit 53d3a6f

64 files changed

Lines changed: 1153 additions & 417 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/source/user_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ User Guide
2424
user_guide/schema
2525
user_guide/snapshot
2626
user_guide/manifest
27+
user_guide/manifest_cache
2728
user_guide/data_types
2829
user_guide/primary_key_table
2930
user_guide/append_only_table
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
.. Copyright 2026-present Alibaba Inc.
2+
3+
.. Licensed under the Apache License, Version 2.0 (the "License");
4+
.. you may not use this file except in compliance with the License.
5+
.. You may obtain a copy of the License at
6+
7+
.. http://www.apache.org/licenses/LICENSE-2.0
8+
9+
.. Unless required by applicable law or agreed to in writing, software
10+
.. distributed under the License is distributed on an "AS IS" BASIS,
11+
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
.. See the License for the specific language governing permissions and
13+
.. limitations under the License.
14+
15+
Manifest Cache
16+
==============
17+
18+
Overview
19+
--------
20+
21+
paimon-cpp caches raw manifest file bytes at the ``ObjectsFile<T>::Read()``
22+
layer. The cache uses the public ``Cache`` abstraction and is injected through
23+
``ScanContextBuilder`` or ``ReadContextBuilder``. The cache covers data
24+
manifests, manifest lists, and index manifests because they all read through
25+
``ObjectsFile<T>``.
26+
27+
For repeated ``get``, ``scan``, or batch ``get/scan -f`` requests in the same
28+
process, the same snapshot often reads the same manifest files repeatedly. On a
29+
cache hit, the read path skips remote filesystem ``open/read``, builds an
30+
in-memory input stream from cached bytes, and still runs the format reader,
31+
Arrow decoding, and object deserialization. This design primarily reduces
32+
remote IO latency and bandwidth while keeping cache weight aligned with the
33+
actual cached bytes.
34+
35+
Configuration
36+
-------------
37+
38+
Manifest caching is disabled by default. Embedding applications that need it can
39+
provide a custom ``Cache`` implementation and inject it through ``WithCache``.
40+
The builder wraps the cache as ``CacheKind::Manifest`` internally, so callers do
41+
not need to pass the cache kind through scan or read contexts. The same cache
42+
instance can be reused across multiple scan or read contexts when process-local
43+
sharing is desired.
44+
45+
Example:
46+
47+
.. code-block:: cpp
48+
49+
auto manifest_cache = std::make_shared<MyCache>();
50+
51+
paimon::ScanContextBuilder scan_builder(table_path);
52+
scan_builder.WithCache(manifest_cache);
53+
54+
paimon::ReadContextBuilder read_builder(table_path);
55+
read_builder.WithCache(manifest_cache);
56+
57+
Passing ``nullptr`` or omitting ``WithCache()`` leaves manifest caching disabled.
58+
59+
Implementation Notes
60+
--------------------
61+
62+
- The cache key is an internal whole-file position key with ``path``,
63+
``offset = 0``, and ``length = -1``, so cache hits do not need an extra
64+
file-status lookup.
65+
- The cache value stores full manifest file bytes as a ``MemorySegment``.
66+
- The cache instance is supplied by the embedding application and passed through
67+
read or scan contexts.
68+
- On a cache hit, ``ObjectsFile::Read()`` keeps the cached ``Bytes`` alive while
69+
``ByteArrayInputStream`` reads from it without copying. Caller filters are
70+
applied after records are decoded.
71+
- ``Read()`` keeps its existing append semantics, which is required when base
72+
and delta manifests are loaded into the same result vector.
73+
74+
Test Coverage
75+
-------------
76+
77+
- ``ScanContextTest`` and ``ReadContextTest`` validate cache injection through
78+
``WithCache(cache)``.
79+
- ``ManifestFileTest.TestReadUsesManifestCache`` validates cache hits on
80+
repeated reads with an injected manifest cache.
81+
- ``ManifestFileTest.TestManifestCacheIsDisabledWithoutInjectedCache`` validates
82+
that omitting an injected cache still reopens files for every read.
83+
- ``ManifestFileTest.TestManifestCacheCanBeDisabled`` validates that omitting
84+
the option still reopens files for every read.
85+
86+
Future Optimizations
87+
--------------------
88+
89+
- Add hit, miss, bypass, and eviction metrics to read trace or metrics.
90+
- Add single-flight loading for high-concurrency misses on the same manifest
91+
path.
92+
- Evaluate a decoded-records second-level cache, configurable as a
93+
CPU-vs-memory tradeoff.
Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,58 @@
1515
*/
1616

1717
#pragma once
18+
19+
#include <cstddef>
1820
#include <cstdint>
1921
#include <functional>
2022
#include <memory>
2123
#include <string>
2224

23-
#include "paimon/common/io/cache/cache_key.h"
24-
#include "paimon/common/memory/memory_segment.h"
25+
#include "paimon/memory/memory_segment.h"
2526
#include "paimon/result.h"
27+
#include "paimon/visibility.h"
2628

2729
namespace paimon {
2830

2931
class CacheValue;
32+
enum class CacheKind {
33+
Manifest,
34+
};
35+
36+
class PAIMON_EXPORT CacheKey {
37+
public:
38+
static std::shared_ptr<CacheKey> ForPosition(const std::string& file_path, int64_t position,
39+
int32_t length, bool is_index);
40+
41+
public:
42+
virtual ~CacheKey() = default;
43+
44+
virtual bool IsIndex() const = 0;
45+
46+
void SetKind(CacheKind kind) {
47+
kind_ = kind;
48+
}
49+
50+
CacheKind GetKind() const {
51+
return kind_;
52+
}
53+
54+
virtual bool Equals(const CacheKey& other) const = 0;
55+
56+
virtual size_t HashCode() const = 0;
57+
58+
private:
59+
CacheKind kind_ = CacheKind::Manifest;
60+
};
3061

31-
/// Callback invoked when a cache entry is evicted by the LRU policy.
3262
using CacheCallback = std::function<void(const std::shared_ptr<CacheKey>&)>;
3363

3464
class PAIMON_EXPORT Cache {
3565
public:
66+
static std::shared_ptr<Cache> WarpKind(CacheKind kind, const std::shared_ptr<Cache>& cache);
67+
3668
virtual ~Cache() = default;
69+
3770
virtual Result<std::shared_ptr<CacheValue>> Get(
3871
const std::shared_ptr<CacheKey>& key,
3972
std::function<Result<std::shared_ptr<CacheValue>>(const std::shared_ptr<CacheKey>&)>
@@ -49,31 +82,21 @@ class PAIMON_EXPORT Cache {
4982
virtual size_t Size() const = 0;
5083
};
5184

52-
class CacheValue {
85+
class PAIMON_EXPORT CacheValue {
5386
public:
54-
CacheValue(const MemorySegment& segment, CacheCallback callback)
55-
: segment_(segment), callback_(std::move(callback)) {}
87+
CacheValue(const MemorySegment& segment, CacheCallback callback);
5688

57-
const MemorySegment& GetSegment() const {
58-
return segment_;
59-
}
89+
~CacheValue();
6090

61-
/// Invoke the eviction callback, if one was registered.
62-
void OnEvict(const std::shared_ptr<CacheKey>& key) const {
63-
if (callback_) {
64-
callback_(key);
65-
}
66-
}
91+
const MemorySegment& GetSegment() const;
6792

68-
bool operator==(const CacheValue& other) const {
69-
if (this == &other) {
70-
return true;
71-
}
72-
return segment_ == other.segment_;
73-
}
93+
void OnEvict(const std::shared_ptr<CacheKey>& key) const;
94+
95+
bool operator==(const CacheValue& other) const;
7496

7597
private:
7698
MemorySegment segment_;
7799
CacheCallback callback_;
78100
};
101+
79102
} // namespace paimon
Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
#include <memory>
2323
#include <type_traits>
2424

25-
#include "paimon/common/utils/math.h"
2625
#include "paimon/io/byte_order.h"
2726
#include "paimon/memory/bytes.h"
2827
#include "paimon/visibility.h"
@@ -138,13 +137,7 @@ class PAIMON_EXPORT MemorySegment {
138137
std::memcpy(MutableData() + index, &value, sizeof(T));
139138
}
140139

141-
inline uint64_t GetLongBigEndian(int32_t index) const {
142-
auto value = GetValue<uint64_t>(index);
143-
if constexpr (SystemByteOrder() == ByteOrder::PAIMON_LITTLE_ENDIAN) {
144-
return EndianSwapValue(value);
145-
}
146-
return value;
147-
}
140+
uint64_t GetLongBigEndian(int32_t index) const;
148141

149142
void CopyTo(int32_t offset, MemorySegment* target, int32_t target_offset,
150143
int32_t num_bytes) const {

include/paimon/read_context.h

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <string>
2424
#include <vector>
2525

26+
#include "paimon/cache/cache.h"
2627
#include "paimon/predicate/predicate.h"
2728
#include "paimon/result.h"
2829
#include "paimon/type_fwd.h"
@@ -54,7 +55,8 @@ class PAIMON_EXPORT ReadContext {
5455
const std::shared_ptr<FileSystem>& specific_file_system,
5556
const std::map<std::string, std::string>& fs_scheme_to_identifier_map,
5657
const std::map<std::string, std::string>& options,
57-
PrefetchCacheMode prefetch_cache_mode, const CacheConfig& cache_config);
58+
PrefetchCacheMode prefetch_cache_mode, const CacheConfig& cache_config,
59+
const std::shared_ptr<Cache>& cache);
5860
~ReadContext();
5961

6062
const std::string& GetPath() const {
@@ -124,6 +126,10 @@ class PAIMON_EXPORT ReadContext {
124126
return cache_config_;
125127
}
126128

129+
std::shared_ptr<Cache> GetCache() const {
130+
return cache_;
131+
}
132+
127133
private:
128134
std::string path_;
129135
std::string branch_;
@@ -144,6 +150,7 @@ class PAIMON_EXPORT ReadContext {
144150
std::map<std::string, std::string> options_;
145151
PrefetchCacheMode prefetch_cache_mode_;
146152
CacheConfig cache_config_;
153+
std::shared_ptr<Cache> cache_;
147154
};
148155

149156
/// `ReadContextBuilder` used to build a `ReadContext`, has input validation.
@@ -339,6 +346,10 @@ class PAIMON_EXPORT ReadContextBuilder {
339346
/// @note If not set, use default file system (configured in `Options::FILE_SYSTEM`)
340347
ReadContextBuilder& WithFileSystem(const std::shared_ptr<FileSystem>& file_system);
341348

349+
/// Inject a cache for read operations. Passing nullptr disables cache.
350+
/// @return Reference to this builder for method chaining.
351+
ReadContextBuilder& WithCache(const std::shared_ptr<Cache>& cache);
352+
342353
/// Build and return a `ReadContext` instance with input validation.
343354
/// @return Result containing the constructed `ReadContext` or an error status.
344355
Result<std::unique_ptr<ReadContext>> Finish();

include/paimon/scan_context.h

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <string>
2424
#include <vector>
2525

26+
#include "paimon/cache/cache.h"
2627
#include "paimon/global_index/global_index_result.h"
2728
#include "paimon/predicate/predicate.h"
2829
#include "paimon/result.h"
@@ -48,7 +49,8 @@ class PAIMON_EXPORT ScanContext {
4849
const std::shared_ptr<MemoryPool>& memory_pool,
4950
const std::shared_ptr<Executor>& executor,
5051
const std::shared_ptr<FileSystem>& specific_file_system,
51-
const std::map<std::string, std::string>& options);
52+
const std::map<std::string, std::string>& options,
53+
const std::shared_ptr<Cache>& cache);
5254

5355
~ScanContext();
5456

@@ -86,6 +88,10 @@ class PAIMON_EXPORT ScanContext {
8688
return specific_file_system_;
8789
}
8890

91+
std::shared_ptr<Cache> GetCache() const {
92+
return cache_;
93+
}
94+
8995
private:
9096
std::string path_;
9197
bool is_streaming_mode_;
@@ -96,6 +102,7 @@ class PAIMON_EXPORT ScanContext {
96102
std::shared_ptr<Executor> executor_;
97103
std::shared_ptr<FileSystem> specific_file_system_;
98104
std::map<std::string, std::string> options_;
105+
std::shared_ptr<Cache> cache_;
99106
};
100107

101108
/// Filter configuration for table scan operations
@@ -178,6 +185,10 @@ class PAIMON_EXPORT ScanContextBuilder {
178185
/// @note If not set, use default file system (configured in `Options::FILE_SYSTEM`)
179186
ScanContextBuilder& WithFileSystem(const std::shared_ptr<FileSystem>& file_system);
180187

188+
/// Inject a cache for scan operations. Passing nullptr disables cache.
189+
/// @return Reference to this builder for method chaining.
190+
ScanContextBuilder& WithCache(const std::shared_ptr<Cache>& cache);
191+
181192
/// Build and return a `ScanContext` instance with input validation.
182193
/// @return Result containing the constructed `ScanContext` or an error status.
183194
Result<std::unique_ptr<ScanContext>> Finish();

src/paimon/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ set(PAIMON_COMMON_SRCS
6666
common/io/data_output_stream.cpp
6767
common/io/memory_segment_output_stream.cpp
6868
common/io/offset_input_stream.cpp
69+
common/io/cache/cache.cpp
6970
common/io/cache/cache_key.cpp
7071
common/io/cache/cache_manager.cpp
7172
common/io/cache/lru_cache.cpp

src/paimon/common/data/abstract_binary_writer.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
#include <string_view>
2222

2323
#include "paimon/common/data/binary_writer.h"
24-
#include "paimon/common/memory/memory_segment.h"
24+
#include "paimon/memory/memory_segment.h"
2525

2626
namespace paimon {
2727
class BinaryArray;

src/paimon/common/data/binary_array.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222

2323
#include "paimon/common/data/binary_array_writer.h"
2424
#include "paimon/common/data/binary_data_read_utils.h"
25-
#include "paimon/common/memory/memory_segment.h"
2625
#include "paimon/memory/memory_pool.h"
26+
#include "paimon/memory/memory_segment.h"
2727

2828
namespace paimon {
2929

src/paimon/common/data/binary_array_writer.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
#include "arrow/api.h"
2222
#include "paimon/common/data/abstract_binary_writer.h"
23-
#include "paimon/common/memory/memory_segment.h"
23+
#include "paimon/memory/memory_segment.h"
2424
namespace paimon {
2525
class BinaryArray;
2626
class MemoryPool;

0 commit comments

Comments
 (0)