-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Describe the enhancement requested
Background
Currently, all Arrow chunk readers (
VertexPropertyArrowChunkReader, AdjListArrowChunkReader, AdjListOffsetArrowChunkReader, AdjListPropertyArrowChunkReader) discard the loaded chunk_table_ every time the chunk position changes via seek(), next_chunk(), or seek_chunk_index(). This means that if a user seeks back to a previously loaded chunk, the entire Parquet file must be re-opened, metadata parsed, and data decoded again — even though the data hasn't changed.
This is particularly costly in graph traversal workloads (BFS, PageRank, label filtering) where vertex/edge access patterns exhibit strong locality, causing the same chunks to be read repeatedly.
Proposal
Introduce a generic
LruCache<Key, Value> and integrate it into all four chunk reader classes. When a chunk is loaded from disk, it is stored in the cache. On subsequent seeks to the same chunk, the cached arrow::Table is returned directly, avoiding file I/O entirely.
Benchmark Results
With a capacity-4 LRU cache on the LDBC sample dataset (Release build, macOS ARM):
TODO
- Integrate LruCache<IdType, shared_ptr> into VertexPropertyArrowChunkReader
- Integrate LruCache<pair<IdType,IdType>, shared_ptr, PairHash> into AdjListArrowChunkReader
- ...
Component(s)
C++