Skip to content

Parallelize and bulk-resolve getViews in the Iceberg JDBC catalog#29751

Open
tbaeg wants to merge 4 commits into
trinodb:masterfrom
tbaeg:perf/jdbc-parallel-get-views
Open

Parallelize and bulk-resolve getViews in the Iceberg JDBC catalog#29751
tbaeg wants to merge 4 commits into
trinodb:masterfrom
tbaeg:perf/jdbc-parallel-get-views

Conversation

@tbaeg

@tbaeg tbaeg commented Jun 4, 2026

Copy link
Copy Markdown
Member

Description

#29738 should be merged first.

Problem

Listing views in the Iceberg JDBC catalog (e.g. querying information_schema.views) was slow and got linearly worse as the number of views grew.

Loading each view requires two network round trips:

  1. A query to the metastore database to resolve where the view's metadata file lives.
  2. A read of that metadata file from the underlying file system (e.g. an S3 GET).

For a schema with N views, this meant N database queries plus N file reads, all performed one after another. Listing views in a busy schema therefore hammered the metastore DB and stacked up file-read latency serially.

Solution

Two changes, one per commit:

  1. Load views in parallel.
    Instead of loading views one at a time, the per-view loads are fanned out across the shared metadata executor (bounded by iceberg.metadata-parallelism). The file reads now happen concurrently rather than back-to-back.

  2. Resolve all view metadata pointers in a single bulk query.
    Parallelizing still left one separate database query per view. This adds a bulk lookup that resolves every view's metadata location for a schema in one query, replacing the N per-view queries.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@github-actions github-actions Bot added iceberg Iceberg connector cla-signed labels Jun 4, 2026
@tbaeg tbaeg force-pushed the perf/jdbc-parallel-get-views branch from d5e21b1 to 49687d2 Compare June 6, 2026 03:13
@tbaeg tbaeg changed the title Parallelize getViews in Iceberg JDBC catalog Parallelize and bulk-resolve getViews in the Iceberg JDBC catalog Jun 6, 2026
tbaeg and others added 4 commits June 6, 2026 10:40
TrinoJdbcCatalog.getViews listed views across every namespace in the
catalog, ignoring the namespace argument it was given. Route it through
the listNamespaces helper so it only scans the requested namespace.

The helper now returns an empty list for a present but non-existent
namespace rather than falling back to listing all namespaces. Callers
already verify that a namespace exists, so this is not a correctness
change for listTables/listIcebergTables/listViews; it just avoids issuing
extra JDBC queries for a namespace that does not exist.

Add TestTrinoJdbcCatalog, a BaseTrinoCatalogTest implementation backed by
the Postgres-based TestingIcebergJdbcServer, and add testNamespaceFilter
to the shared base test covering listTables, listViews and getViews
honoring the namespace filter. TrinoRestCatalog.getViews throws for a
non-existent namespace, so TestTrinoRestCatalog overrides the test to
document that behavior until it is normalized.

Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
TrinoJdbcCatalog.getViews loaded each view sequentially. Every loadView
resolves the view metadata pointer over JDBC and then reads the view
metadata JSON from the underlying file system (e.g. an S3 GET), so
listing views for information_schema.views incurred one network
round-trip per view, serialized.

Fan out the per-view loads across the shared @ForIcebergMetadata
executor via processWithAdditionalThreads, bounded by
iceberg.metadata-parallelism, mirroring TrinoGlueCatalog. Because the
work is I/O-wait bound, this collapses the wall-clock cost from N
sequential round-trips to N/parallelism.

Extract the load-and-convert logic into loadViewDefinition, which skips
the viewExists check since getViews has already listed the view, halving
its JDBC calls. getView keeps the viewExists check for callers that pass
an unverified name. loadViewDefinition treats NoSuchViewException as an
absent view to stay graceful when a view is dropped concurrently with
listing.

Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
getViews loaded each view via jdbcCatalog.loadView, whose
JdbcViewOperations.doRefresh issues a per-view JDBC query
(JdbcUtil.loadView) to resolve the metadata pointer before reading the
metadata JSON. So even after the prior parallelization there was still
one JDBC round trip per view, just spread across threads.

Add IcebergJdbcClient.getViewMetadataLocations, which resolves every
view's metadata_location for a namespace in a single query (views are
stored in iceberg_tables with iceberg_type = 'VIEW'). getViews now runs
that one bulk query per namespace - also subsuming the listViews call -
and parses each view's metadata JSON directly via ViewMetadataParser on
the metadata-fetching executor, mirroring how tables already resolve a
location through IcebergJdbcClient and then read JSON via FileIO.

The view-definition conversion is factored into createViewDefinition,
shared by the bulk path and the single-view getView path (which keeps
using loadView). A view dropped concurrently with listing is treated as
absent (NotFoundException), matching the prior NoSuchViewException
handling.

Co-authored-by: jdchandler88 <jdchandler88@gmail.com>
@tbaeg tbaeg force-pushed the perf/jdbc-parallel-get-views branch from 49687d2 to 32040eb Compare June 6, 2026 17:32
@tbaeg

tbaeg commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

Failure doesn't appear to be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

1 participant