You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
enhance: cache reusable format reader metadata (#552)
This change makes repeated reads cheaper by keeping reusable
format-reader metadata around for the lifetime of a Reader. The main
need here is to avoid reopening the same file metadata over and over
when different read paths, projections, predicates, chunk reads, or take
calls touch the same column group files.
The implementation adds a typed metadata cache for Parquet, Iceberg,
Lance, and Vortex readers. Each format now separates the stable file
metadata loading step from the stateful reader creation step, so schema,
row group layout, footer data, Lance fragment metadata, Vortex file
handles, and Iceberg positional delete information can be reused while
each actual reader still gets its own projection and predicate state. At
a high level, the cache now sits behind ReaderImpl and is shared by all
column-group read paths. The important part is that we cache only
immutable metadata, then create a fresh reader from that metadata
whenever the caller brings a different projection or predicate.
```
ReaderImpl
owns MetadataCache
owns FormatReaderMetadataCaches<Parquet, Iceberg, Lance, Vortex>
owns FormatReaderMetadataCache<ReaderT>
entries_[cache_key] = immutable FormatReaderMetadata<Payload>
in_flight_loads_[cache_key] = singleflight loader state
read flow
metadata = cache.get_or_open(key, load_metadata)
reader = FormatReader::create_from_metadata(metadata, file, projection, predicate)
```
The cache also tracenhance: cache reusable format reader metadata
This change makes repeated reads cheaper by keeping reusable
format-reader metadata around for the lifetime of a Reader. The main
need here is to avoid reopening the same file metadata over and over
when different read paths, projections, predicates, chunk reads, or take
calls touch the same column group files.
The implementation adds a typed metadata cache for Parquet, Iceberg,
Lance, and Vortex readers. Each format now separates the stable file
metadata loading step from the stateful reader creation step, so schema,
row group layout, footer data, Lance fragment metadata, Vortex file
handles, and Iceberg positional delete information can be reused while
each actual reader still gets its own projection aks in-flight loads per
key, so concurrent requests for the same file wait on the same metadata
load instead of doing duplicate work. It is wired through ReaderImpl,
ColumnGroupReader, and ColumnGroupLazyReader, so batch reads, chunk
readers, and lazy take reads all use the same metadata source while
keeping the caller-facing reader behavior unchanged.
---------
Signed-off-by: jiaqizho <jiaqi.zhou@zilliz.com>
0 commit comments