Skip to content

[Proposal][C++]: Add LRU chunk cache to Arrow chunk readers to avoid redundant file I/O #860

@SYaoJun

Description

@SYaoJun

Describe the enhancement requested

Background

Image

Currently, all Arrow chunk readers (
VertexPropertyArrowChunkReader, AdjListArrowChunkReader, AdjListOffsetArrowChunkReader, AdjListPropertyArrowChunkReader) discard the loaded chunk_table_ every time the chunk position changes via seek(), next_chunk(), or seek_chunk_index(). This means that if a user seeks back to a previously loaded chunk, the entire Parquet file must be re-opened, metadata parsed, and data decoded again — even though the data hasn't changed.

This is particularly costly in graph traversal workloads (BFS, PageRank, label filtering) where vertex/edge access patterns exhibit strong locality, causing the same chunks to be read repeatedly.

Proposal

Introduce a generic
LruCache<Key, Value> and integrate it into all four chunk reader classes. When a chunk is loaded from disk, it is stored in the cache. On subsequent seeks to the same chunk, the cached arrow::Table is returned directly, avoiding file I/O entirely.

Benchmark Results

With a capacity-4 LRU cache on the LDBC sample dataset (Release build, macOS ARM):

Image

TODO

  • Integrate LruCache<IdType, shared_ptr> into VertexPropertyArrowChunkReader
  • Integrate LruCache<pair<IdType,IdType>, shared_ptr, PairHash> into AdjListArrowChunkReader
  • ...

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions