Skip to content

Conversation

@eshishki
Copy link
Contributor

Why I'm doing:

Avoid io on query plannig for caching iceberg catalog

What I'm doing:

Add missing manifest list cache

  When planning an Iceberg table scan, the following caches are consulted in order:

  1. Table (from tables cache)
     └─ CachingIcebergCatalog.getTable()
     └─ Cache: tables (IcebergTableCacheKey -> Table)
     └─ On miss: reads metadata.json from S3/HDFS

  2. Snapshot (from Table object)
     └─ table.currentSnapshot() or table.snapshot(snapshotId)
     └─ Not cached separately — part of the Table object

  3. List<ManifestFile> (from manifestListCache) ← NEW
     └─ StarRocksIcebergTableScan.getDataManifestsWithCache()
     └─ Cache: manifestListCache (snapshotId -> SnapshotManifests)
     └─ On miss: snapshot.dataManifests(io()) — reads snap-*.avro

  4. Manifest filtering
     └─ StarRocksIcebergTableScan.findMatchingDataManifests()
     └─ Applies partition filter to ManifestFile.partitions()
     └─ Prunes manifests that cannot contain matching data

  5. Set<DataFile> per manifest
     └─ ManifestReader / StarRocksManifestReader
     └─ Cache: dataFileCache (manifest path -> Set<DataFile>)
     └─ On miss: reads *.avro manifest file
     └─ Records path in metaFileCacheMap for invalidation

  6. DataFile filtering
     └─ Applies residual filter to each DataFile
     └─ Checks column statistics (min/max/null count)

  7. Same flow for DeleteFiles
     └─ getDeleteManifestsWithCache() → deleteFileCache

  Cache hierarchy:
  metadata.json ─[tables cache]─► Table
                                    │
                                    ▼
                                Snapshot
                                    │
                      ┌─[manifestListCache]─┐
                      │                     │
                      ▼                     ▼
              List<ManifestFile>     List<ManifestFile>
                (data)                 (delete)
                      │                     │
           ┌─[dataFileCache]─┐    ┌─[deleteFileCache]─┐
           │                 │    │                   │
           ▼                 ▼    ▼                   ▼
      Set<DataFile>    Set<DataFile>  Set<DeleteFile> ...

Fixes #67033

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • [] Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

@eshishki eshishki requested a review from a team as a code owner December 19, 2025 23:06
@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@Wenjun7J
Copy link
Contributor

  @Override
  public List<ManifestFile> dataManifests(FileIO fileIO) {
    if (dataManifests == null) {
      cacheManifests(fileIO);
    }
    return dataManifests;
  }

I think BaseSnapshot will cache the results?

@eshishki
Copy link
Contributor Author

@Wenjun7J you are absolutely right, i missed that in iceberg class
sorry for the noise

@eshishki eshishki closed this Dec 26, 2025
@eshishki eshishki deleted the feature/manifest-list-cache branch December 26, 2025 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Iceberg Snapshot-to-ManifestFile Mapping Cache

2 participants