Skip to content

[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419

Open
wangzhigang1999 wants to merge 4 commits intoapache:masterfrom
wangzhigang1999:KYUUBI-7305
Open

[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419
wangzhigang1999 wants to merge 4 commits intoapache:masterfrom
wangzhigang1999:KYUUBI-7305

Conversation

@wangzhigang1999
Copy link
Copy Markdown
Contributor

Why are the changes needed?

This is a bug fix.

The Hive JDBC driver rewrites a null schemaPattern to "%" in GetTables, and the JDBC engine forwards it directly into WHERE TABLE_SCHEMA LIKE '%', so DBeaver — which always passes null once a connection is established — sees tables from every database in the backend rather than the one in its connection URL. Compounding this, the JDBC engine never applies the URL-level database to the underlying JDBC connection, so there is no "current database" on the backend for a filter to fall back to. The fix introduces a dialect-level setCurrentDatabase hook, tracks an effectiveDatabase on the session, and routes blank/"%" schema patterns through it in GetTablesOperation. MySQLDialect additionally implements getCatalogsOperation / getSchemasOperation with JDBC-standard column labels (TABLE_CATALOG / TABLE_SCHEM) so the schemas tree in DBeaver populates correctly.

Non-MySQL dialects inherit the default no-op setCurrentDatabase and effectiveDatabase stays None, so PostgreSQL / Oracle / etc. keep their existing behavior.

Note: getSchemas() continues to return all accessible schemas per the JDBC spec — the URL-level database only affects the default query context and the blank/"%" fallback in getTables. This matches the behavior of the native MySQL JDBC driver. Clients that want to restrict the visible schema list should do so via client-side filters (e.g. DBeaver's "Schemas" filter).

How was this patch tested?

Added unit / integration tests in MySQLOperationSuite covering:

  • getTables with a database in the URL returns only that database's tables
  • getTables with a "%" schema pattern is routed to the session's effective database
  • getSchemas returns results with JDBC-standard TABLE_CATALOG / TABLE_SCHEM column labels
  • getCatalogs behavior

Also verified end-to-end against a live StarRocks backend using Spark's Hive-JDBC beeline:

# URL with a database → only that database's tables
jdbc:hive2://127.0.0.1:10009/doctor              → !tables : 3 rows, all TABLE_SCHEMA=doctor
jdbc:hive2://127.0.0.1:10009/information_schema  → !tables : only information_schema views

# URL without a database → previous no-filter behavior preserved
jdbc:hive2://127.0.0.1:10009/                    → !tables : tables from every database

# DatabaseMetaData.getSchemas() returns JDBC-standard columns
columns: TABLE_CATALOG, TABLE_SCHEM
rows:    def/doctor, def/information_schema, def/sys, ...
image image

Was this patch authored or co-authored using generative AI tooling?

Human-authored core design and implementation. Tests, verification, and PR description drafting were assisted by Claude Code.

Generated-by: Claude Code (Opus 4.7)

Closes #7305. Supersedes the stale #7307 per @pan3793's suggestion there.
Co-authored-by: zhzhenqin zhzhenqin@users.noreply.github.com

…ses when a database is specified in the URL

When DBeaver (or any Hive-JDBC client) connects to the Kyuubi JDBC engine with
a database path in the URL, DatabaseMetaData.getTables should return only the
tables from that database. Previously, the Hive JDBC driver converted a null
schemaPattern to "%" before sending the request, which the engine forwarded
straight into a WHERE TABLE_SCHEMA LIKE '%' clause on the backend, returning
tables from every database.

- Introduce JdbcDialect.setCurrentDatabase(connection, database) as an
  extension point; the default is a no-op returning false. MySQLDialect
  overrides it with USE `db` so MySQL/StarRocks/Doris switch the backend
  connection on session open.
- JdbcSessionImpl records the successfully applied database in
  effectiveDatabase and only keeps it when setCurrentDatabase returns true.
  "default" (the Hive JDBC driver's fallback when no database is specified in
  the URL) is tolerated on USE failure to stay compatible with backends that
  have no "default" database.
- JdbcOperationManager.newGetTablesOperation treats a blank schema or "%" as
  "use the session's effective database", routing through
  effectiveDatabase.orNull. Backends without an effective database
  (PostgreSQL, Oracle, etc.) keep the previous no-filter behavior.
- MySQLDialect now implements getCatalogsOperation and getSchemasOperation
  (previously featureNotSupported). getSchemasOperation aliases result columns
  to the JDBC-standard TABLE_CATALOG / TABLE_SCHEM so
  DatabaseMetaData.getSchemas works with clients such as DBeaver.

Closes apache#7305.
@wangzhigang1999 wangzhigang1999 marked this pull request as ready for review April 24, 2026 10:54
* The return value is used by `JdbcOperationManager` to decide whether the requested
* database can be treated as the effective schema filter for metadata operations.
*/
def setCurrentDatabase(connection: Connection, database: String): Boolean = false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Connection.setSchema work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — I looked into this and it's a bit subtle. setSchema doesn't reliably work for the MySQL family in mysql-connector-j's default config. From ConnectionImpl.java (9.x):

// line 2625
public void setSchema(String schema) throws SQLException {
    checkClosed();
    if (this.propertySet.<DatabaseTerm>getEnumProperty(PropertyKey.databaseTerm).getValue() == DatabaseTerm.SCHEMA) {
        setDatabase(schema);
    }
}

// line 2143
public void setCatalog(final String catalog) throws SQLException {
    if (this.propertySet.<DatabaseTerm>getEnumProperty(PropertyKey.databaseTerm).getValue() == DatabaseTerm.CATALOG) {
        setDatabase(catalog);
    }
}

The driver gates both methods on the databaseTerm connection property:

databaseTerm setCatalog(x) setSchema(x)
CATALOG (default) sends COM_INIT_DB silent no-op
SCHEMA silent no-op sends COM_INIT_DB

So calling setSchema alone would silently do nothing under the default config (and reproduce the original bug). Note this also affects StarRocksDialect / DorisDialect since they extend MySQLDialect and use the same driver — databaseTerm is a driver-side property, not a server feature.

Two options I'm considering, happy to take direction:

Option A — keep USE db as-is in MySQLDialect (current PR). Simple, always works regardless of driver config, but it's hand-rolled SQL.

Option B — go through JDBC standard API, MySQL-family only:

Keep JdbcDialect.setCurrentDatabase as the default no-op returning false (PG / Oracle / etc. unchanged from this PR). Only MySQLDialect overrides it:

// MySQLDialect: call both — under mysql-connector-j, exactly one is a real op
// for either databaseTerm value, the other is a guaranteed no-op.
// Inherited by StarRocksDialect / DorisDialect.
override def setCurrentDatabase(connection: Connection, database: String): Boolean = {
  connection.setCatalog(database)
  connection.setSchema(database)
  true
}

This still avoids the hand-rolled USE SQL and gets the driver hooks (useLocalSessionState, ConnectionLifecycleInterceptor), but doesn't touch any non-MySQL dialect's behavior.

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update — went with Option B and pushed it as ff51841.

MySQLDialect.setCurrentDatabase now calls connection.setCatalog(database) followed by connection.setSchema(database) instead of executing USE ourselves. JdbcDialect's default stays a no-op returning false, and no other dialect is touched.

Why B over A:

  • Goes through JDBC standard API, addressing the spirit of your review — no more hand-rolled SQL, no backtick escaping concerns.
  • Engages the driver's internal hooks (useLocalSessionState cache, ConnectionLifecycleInterceptor) so connection state stays coherent.
  • Double-call cost is effectively zero: under any databaseTerm, the non-matching branch is if (false) return with no network IO.
  • Functionally equivalent for MySQL / StarRocks / Doris — verified locally with DBeaver: logs show catalog=null → catalog=doctor, confirming setCatalog actually sends COM_INIT_DB.

Why I didn't extend setSchema to the default impl for other dialects: Phoenix / Impala drivers are likely to throw SQLFeatureNotSupportedException from setSchema, which would regress session open. Keeping the change scoped to MySQL family.

Also added three StarRocks regression tests covering URL-scoped getTables, no-filter when URL has no db, and session-open failure on a non-existent db.

PTAL.

Copy link
Copy Markdown
Member

@pan3793 pan3793 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangzhigang1999 how about

  • replica the API with the JDBC standard API, with an additional leading conn: Connection parameter, default impl calls the JDBC standard API
  • override the impl for known JDBC drivers that do not support this API
  • don't need to care about the return value, always treat it as if it takes effect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or investigate how DBeaver handles that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into DBeaver source. The root cause isn't on the client side — both DBeaver and the Hive driver are spec-correct (null schemaPattern legitimately means "all schemas"). The real issue is that the engine was discarding the URL's database entirely: not switching the backend connection, not using it as the default metadata scope. Engine-side fix is the right place.

Will follow up with the API cleanup you suggested (default setSchema, override Phoenix/Impala if needed, drop the boolean).

…tandard setCatalog/setSchema

Replace the hand-rolled `USE \`db\`` with `Connection.setCatalog` plus
`Connection.setSchema` in MySQLDialect.setCurrentDatabase. mysql-connector-j
gates the two methods on its `databaseTerm` connection property: with the
default CATALOG, setCatalog issues COM_INIT_DB and setSchema is a silent
no-op; with SCHEMA it is the inverse. Calling both means exactly one fires
regardless of how the user configured the URL, while still going through
standard JDBC API (and therefore the driver's session-state hooks instead of
hand-rolled SQL). StarRocksDialect / DorisDialect inherit this via
MySQLDialect.

Also add three StarRocks regression tests covering: scoping when URL
specifies a database, no-filter behavior when URL has no database, and
session-open failure on a non-existent database.
…Dialect

Per review feedback, replicate the JDBC `Connection#setSchema(String)` API on
JdbcDialect with a leading `conn` parameter. The default forwards to the
standard API; dialects whose driver does not support it can override to a
no-op. Drop the boolean return value — callers treat the call as if it always
takes effect. JdbcSessionImpl now records `effectiveDatabase` unconditionally
after the call, decoupling "backend connection switched" from "engine
metadata scope".
// Calling both means exactly one fires regardless of how the user configured the
// URL. StarRocksDialect / DorisDialect inherit this via MySQLDialect.
conn.setCatalog(schema)
conn.setSchema(schema)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is this really safe? I think StarRocks and Doris support both catalog and schema ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point to think through. The dual setCatalog + setSchema is safe because mysql-connector-j picks one of the two to actually send and silently swallows the other inside the driver, based on its databaseTerm connection property: with the default CATALOG, setCatalog sends a "switch database" command and setSchema does nothing; with SCHEMA it's the reverse. So no matter how the user configured the URL, the StarRocks/Doris server only ever receives one "switch database" command, never two.

Confirmed by reading the driver source — both methods funnel into the same setDatabase(...) call after the databaseTerm check, see ConnectionImpl.java#L2145. setCatalog is literally if (databaseTerm == CATALOG) setDatabase(catalog); and setSchema is the symmetric if (databaseTerm == SCHEMA) setDatabase(schema); — exactly one branch fires per call site, neither throws when its branch is skipped.

The server treats that command as "switch the current database" — it does not treat it as "switch catalog" (catalog switching on StarRocks/Doris is a separate SET CATALOG SQL statement, not something the MySQL protocol can do). The name we pass in comes from USE_DATABASE, which the Hive JDBC driver parses out of the URL path. The Hive URL syntax has no concept of catalog, so this name is always a database in the current (default) catalog.

} else {
schemaName
}
val query = dialect.getTablesQuery(catalogName, effectiveSchema, tableName, tableTypes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall the details of the initial code of this part, but the API looks problematic.

the Hive Thrift GetTablesOperation API is derived from the JDBC API, and that's why they look almost identical, for JDBC engine, such a call should directly map to the JDBC API instead of requesting a SQL (if you take a look at the JDBC driver implementation, you may find some impls are non-SQL-based), the current design complicates the whole thing.

Copy link
Copy Markdown
Contributor Author

@wangzhigang1999 wangzhigang1999 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this — you're right, and I want to confirm I took it seriously rather than hand-wave it away. To make sure I understood, I ran the full set of DatabaseMetaData calls against three different drivers (mysql-connector-j 8.4.0, postgresql 42.7.2, clickhouse-jdbc 0.6.5) using a small Java probe and docker containers. Test fixtures were tailored to each backend (MySQL/PG had PK + FK + a function; ClickHouse used MergeTree which has no native PK/FK).

Cross-driver behavior of each API:

API MySQL (databaseTerm=CATALOG) MySQL (databaseTerm=SCHEMA) PostgreSQL ClickHouse
getCatalogs() ✓ implemented ⚠ empty (driver databaseTerm quirk) ✓ implemented ✓ implemented
getSchemas() ⚠ empty (driver databaseTerm quirk) ✓ implemented ✓ implemented — model has no schema layer
getTableTypes()
getTables(...)
getColumns(...)
getPrimaryKeys — MergeTree has no PK
getImported / Exported / CrossReference — no FK in CK
getFunctions(...) (no fixture created)
getTypeInfo()

Things that hold across all three drivers:

  • No SQLFeatureNotSupportedException anywhere — every JDBC-standard metadata method is implemented and returns a ResultSet rather than throwing, even when the data model doesn't populate it.
  • Column names follow the JDBC spec (TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, KEY_SEQ, PK_NAME, ...) — which is exactly what GetTablesOperation / GetColumnsOperation / etc. on the Hive Thrift side expect, since the Thrift API was derived from the JDBC API in the first place.
  • Driver-specific quirks are minor and well-defineddatabaseTerm gating is unique to mysql-connector-j; PG returns column labels in lowercase while MySQL/CK use uppercase (JDBC spec says case-insensitive, but engine-side string comparisons would need to match).

So the JDBC standard API really does cover every operation the dialect models today, across very different DBMS families. Concretely, this means the JdbcDialect.getXxxQuery family of methods could be deleted in favor of the engine calling connection.getMetaData.getXxx(...) directly — and dialects whose driver implements metadata via a non-SQL path (Avatica, etc.) would automatically be used the way they were designed to.

The few wrinkles I hit are all driver-side and resolvable in a few lines of engine-level adapter code, not by reaching for SQL:

  • getCatalogs / getSchemas are mutually exclusive on mysql-connector-j (gated by databaseTerm) — handled by checking the property and routing, or by calling both and taking whichever is non-empty.
  • DATA_TYPE is int in the JDBC spec but the current SQL returns a string — java.sql.Types-to-name mapping is one helper.
  • The current SQL returns extra MySQL-specific columns (ENGINE, TABLE_ROWS, ...) but the Hive Thrift GetTables result schema is fixed to JDBC-standard columns, so those extras are already dropped during serialization today and aren't reaching clients — no real loss.

I'd like to keep this PR scoped to the URL-database bug since that's what it set out to fix, and migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test — not something I'd want to fold into a bug fix. Happy to open a follow-up issue tracking the migration if you agree with that split. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test

if we do that first, do we still need this patch?

Copy link
Copy Markdown
Member

@pan3793 pan3793 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No SQLFeatureNotSupportedException anywhere

Actually, the JDBC API Javadocs define behavior in a very clear way, it should not throw SQLFeatureNotSupportedException unless Javadocs say it can. Some tools, like DBeaver, may tolerate it if the JDBC driver vendor does not respect the API contract, but tools like DataGrip may not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] DBeaver returned all tables in all databases when connecting.

2 participants