-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Problem Statement
Currently, the Iceberg metadata caching architecture in StarRocks has a gap that causes unnecessary I/O during query planning, even when all relevant data is already cached.
Current Architecture
The caching layer consists of:
- CachingIcebergCatalog.tables — caches Table objects (parsed from metadata.json)
- CachingIcebergCatalog.dataFileCache — caches manifest path → Set
- CachingIcebergCatalog.deleteFileCache — caches manifest path → Set
- IcebergCachingFileIO — caches raw file bytes (memory + optional disk)
The Gap
There is no cache for the mapping snapshotId → List.
During query planning, the following always happens:
// StarRocksIcebergTableScan.java:209
List dataManifests = snapshot.dataManifests(io());
This call reads and parses the manifest-list file (snap-{snapshotId}-{uuid}.avro) on every query, even though:
- The Table object is cached
- All DataFile/DeleteFile entries are cached in dataFileCache/deleteFileCache
- The manifest-list content is immutable for a given snapshot
The caching flow is incomplete:
metadata.json → Table Cached (CachingIcebergCatalog.tables)
snap-*.avro → List NOT cached (parsed every time)
manifest.avro → Set Cached (dataFileCache)
Previously, IcebergCachingFileIO cached raw bytes of all metadata files, including:
- Manifest-list files (snap-*.avro)
- Manifest files (*.avro)
This partially mitigated the issue — while Avro parsing still occurred on every query, at least the file bytes were served from memory/disk cache, avoiding remote I/O. But IcebergCachingFileIO is deprecated now. #61966