Parallelize and bulk-resolve getViews in the Iceberg JDBC catalog#29751
Open
tbaeg wants to merge 4 commits into
Open
Parallelize and bulk-resolve getViews in the Iceberg JDBC catalog#29751tbaeg wants to merge 4 commits into
tbaeg wants to merge 4 commits into
Conversation
d5e21b1 to
49687d2
Compare
TrinoJdbcCatalog.getViews listed views across every namespace in the catalog, ignoring the namespace argument it was given. Route it through the listNamespaces helper so it only scans the requested namespace. The helper now returns an empty list for a present but non-existent namespace rather than falling back to listing all namespaces. Callers already verify that a namespace exists, so this is not a correctness change for listTables/listIcebergTables/listViews; it just avoids issuing extra JDBC queries for a namespace that does not exist. Add TestTrinoJdbcCatalog, a BaseTrinoCatalogTest implementation backed by the Postgres-based TestingIcebergJdbcServer, and add testNamespaceFilter to the shared base test covering listTables, listViews and getViews honoring the namespace filter. TrinoRestCatalog.getViews throws for a non-existent namespace, so TestTrinoRestCatalog overrides the test to document that behavior until it is normalized. Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
TrinoJdbcCatalog.getViews loaded each view sequentially. Every loadView resolves the view metadata pointer over JDBC and then reads the view metadata JSON from the underlying file system (e.g. an S3 GET), so listing views for information_schema.views incurred one network round-trip per view, serialized. Fan out the per-view loads across the shared @ForIcebergMetadata executor via processWithAdditionalThreads, bounded by iceberg.metadata-parallelism, mirroring TrinoGlueCatalog. Because the work is I/O-wait bound, this collapses the wall-clock cost from N sequential round-trips to N/parallelism. Extract the load-and-convert logic into loadViewDefinition, which skips the viewExists check since getViews has already listed the view, halving its JDBC calls. getView keeps the viewExists check for callers that pass an unverified name. loadViewDefinition treats NoSuchViewException as an absent view to stay graceful when a view is dropped concurrently with listing. Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
getViews loaded each view via jdbcCatalog.loadView, whose JdbcViewOperations.doRefresh issues a per-view JDBC query (JdbcUtil.loadView) to resolve the metadata pointer before reading the metadata JSON. So even after the prior parallelization there was still one JDBC round trip per view, just spread across threads. Add IcebergJdbcClient.getViewMetadataLocations, which resolves every view's metadata_location for a namespace in a single query (views are stored in iceberg_tables with iceberg_type = 'VIEW'). getViews now runs that one bulk query per namespace - also subsuming the listViews call - and parses each view's metadata JSON directly via ViewMetadataParser on the metadata-fetching executor, mirroring how tables already resolve a location through IcebergJdbcClient and then read JSON via FileIO. The view-definition conversion is factored into createViewDefinition, shared by the bulk path and the single-view getView path (which keeps using loadView). A view dropped concurrently with listing is treated as absent (NotFoundException), matching the prior NoSuchViewException handling. Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
49687d2 to
32040eb
Compare
Member
Author
|
Failure doesn't appear to be related. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
#29738 should be merged first.
Problem
Listing views in the Iceberg JDBC catalog (e.g. querying information_schema.views) was slow and got linearly worse as the number of views grew.
Loading each view requires two network round trips:
For a schema with N views, this meant N database queries plus N file reads, all performed one after another. Listing views in a busy schema therefore hammered the metastore DB and stacked up file-read latency serially.
Solution
Two changes, one per commit:
Load views in parallel.
Instead of loading views one at a time, the per-view loads are fanned out across the shared metadata executor (bounded by iceberg.metadata-parallelism). The file reads now happen concurrently rather than back-to-back.
Resolve all view metadata pointers in a single bulk query.
Parallelizing still left one separate database query per view. This adds a bulk lookup that resolves every view's metadata location for a schema in one query, replacing the N per-view queries.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: