[KYUUBI #7305] Fix JDBC engine returning tables from all databases when a database is specified in the URL#7419
Conversation
…ses when a database is specified in the URL When DBeaver (or any Hive-JDBC client) connects to the Kyuubi JDBC engine with a database path in the URL, DatabaseMetaData.getTables should return only the tables from that database. Previously, the Hive JDBC driver converted a null schemaPattern to "%" before sending the request, which the engine forwarded straight into a WHERE TABLE_SCHEMA LIKE '%' clause on the backend, returning tables from every database. - Introduce JdbcDialect.setCurrentDatabase(connection, database) as an extension point; the default is a no-op returning false. MySQLDialect overrides it with USE `db` so MySQL/StarRocks/Doris switch the backend connection on session open. - JdbcSessionImpl records the successfully applied database in effectiveDatabase and only keeps it when setCurrentDatabase returns true. "default" (the Hive JDBC driver's fallback when no database is specified in the URL) is tolerated on USE failure to stay compatible with backends that have no "default" database. - JdbcOperationManager.newGetTablesOperation treats a blank schema or "%" as "use the session's effective database", routing through effectiveDatabase.orNull. Backends without an effective database (PostgreSQL, Oracle, etc.) keep the previous no-filter behavior. - MySQLDialect now implements getCatalogsOperation and getSchemasOperation (previously featureNotSupported). getSchemasOperation aliases result columns to the JDBC-standard TABLE_CATALOG / TABLE_SCHEM so DatabaseMetaData.getSchemas works with clients such as DBeaver. Closes apache#7305.
| * The return value is used by `JdbcOperationManager` to decide whether the requested | ||
| * database can be treated as the effective schema filter for metadata operations. | ||
| */ | ||
| def setCurrentDatabase(connection: Connection, database: String): Boolean = false |
There was a problem hiding this comment.
does Connection.setSchema work?
There was a problem hiding this comment.
Good question — I looked into this and it's a bit subtle. setSchema doesn't reliably work for the MySQL family in mysql-connector-j's default config. From ConnectionImpl.java (9.x):
// line 2625
public void setSchema(String schema) throws SQLException {
checkClosed();
if (this.propertySet.<DatabaseTerm>getEnumProperty(PropertyKey.databaseTerm).getValue() == DatabaseTerm.SCHEMA) {
setDatabase(schema);
}
}
// line 2143
public void setCatalog(final String catalog) throws SQLException {
if (this.propertySet.<DatabaseTerm>getEnumProperty(PropertyKey.databaseTerm).getValue() == DatabaseTerm.CATALOG) {
setDatabase(catalog);
}
}The driver gates both methods on the databaseTerm connection property:
databaseTerm |
setCatalog(x) |
setSchema(x) |
|---|---|---|
CATALOG (default) |
sends COM_INIT_DB |
silent no-op |
SCHEMA |
silent no-op | sends COM_INIT_DB |
So calling setSchema alone would silently do nothing under the default config (and reproduce the original bug). Note this also affects StarRocksDialect / DorisDialect since they extend MySQLDialect and use the same driver — databaseTerm is a driver-side property, not a server feature.
Two options I'm considering, happy to take direction:
Option A — keep USE db as-is in MySQLDialect (current PR). Simple, always works regardless of driver config, but it's hand-rolled SQL.
Option B — go through JDBC standard API, MySQL-family only:
Keep JdbcDialect.setCurrentDatabase as the default no-op returning false (PG / Oracle / etc. unchanged from this PR). Only MySQLDialect overrides it:
// MySQLDialect: call both — under mysql-connector-j, exactly one is a real op
// for either databaseTerm value, the other is a guaranteed no-op.
// Inherited by StarRocksDialect / DorisDialect.
override def setCurrentDatabase(connection: Connection, database: String): Boolean = {
connection.setCatalog(database)
connection.setSchema(database)
true
}This still avoids the hand-rolled USE SQL and gets the driver hooks (useLocalSessionState, ConnectionLifecycleInterceptor), but doesn't touch any non-MySQL dialect's behavior.
WDYT?
There was a problem hiding this comment.
Update — went with Option B and pushed it as ff51841.
MySQLDialect.setCurrentDatabase now calls connection.setCatalog(database) followed by connection.setSchema(database) instead of executing USE ourselves. JdbcDialect's default stays a no-op returning false, and no other dialect is touched.
Why B over A:
- Goes through JDBC standard API, addressing the spirit of your review — no more hand-rolled SQL, no backtick escaping concerns.
- Engages the driver's internal hooks (
useLocalSessionStatecache,ConnectionLifecycleInterceptor) so connection state stays coherent. - Double-call cost is effectively zero: under any
databaseTerm, the non-matching branch isif (false) returnwith no network IO. - Functionally equivalent for MySQL / StarRocks / Doris — verified locally with DBeaver: logs show
catalog=null → catalog=doctor, confirmingsetCatalogactually sendsCOM_INIT_DB.
Why I didn't extend setSchema to the default impl for other dialects: Phoenix / Impala drivers are likely to throw SQLFeatureNotSupportedException from setSchema, which would regress session open. Keeping the change scoped to MySQL family.
Also added three StarRocks regression tests covering URL-scoped getTables, no-filter when URL has no db, and session-open failure on a non-existent db.
PTAL.
There was a problem hiding this comment.
@wangzhigang1999 how about
- replica the API with the JDBC standard API, with an additional leading
conn: Connectionparameter, default impl calls the JDBC standard API - override the impl for known JDBC drivers that do not support this API
- don't need to care about the return value, always treat it as if it takes effect.
There was a problem hiding this comment.
or investigate how DBeaver handles that
There was a problem hiding this comment.
Looked into DBeaver source. The root cause isn't on the client side — both DBeaver and the Hive driver are spec-correct (null schemaPattern legitimately means "all schemas"). The real issue is that the engine was discarding the URL's database entirely: not switching the backend connection, not using it as the default metadata scope. Engine-side fix is the right place.
Will follow up with the API cleanup you suggested (default setSchema, override Phoenix/Impala if needed, drop the boolean).
…tandard setCatalog/setSchema Replace the hand-rolled `USE \`db\`` with `Connection.setCatalog` plus `Connection.setSchema` in MySQLDialect.setCurrentDatabase. mysql-connector-j gates the two methods on its `databaseTerm` connection property: with the default CATALOG, setCatalog issues COM_INIT_DB and setSchema is a silent no-op; with SCHEMA it is the inverse. Calling both means exactly one fires regardless of how the user configured the URL, while still going through standard JDBC API (and therefore the driver's session-state hooks instead of hand-rolled SQL). StarRocksDialect / DorisDialect inherit this via MySQLDialect. Also add three StarRocks regression tests covering: scoping when URL specifies a database, no-filter behavior when URL has no database, and session-open failure on a non-existent database.
…Dialect Per review feedback, replicate the JDBC `Connection#setSchema(String)` API on JdbcDialect with a leading `conn` parameter. The default forwards to the standard API; dialects whose driver does not support it can override to a no-op. Drop the boolean return value — callers treat the call as if it always takes effect. JdbcSessionImpl now records `effectiveDatabase` unconditionally after the call, decoupling "backend connection switched" from "engine metadata scope".
| // Calling both means exactly one fires regardless of how the user configured the | ||
| // URL. StarRocksDialect / DorisDialect inherit this via MySQLDialect. | ||
| conn.setCatalog(schema) | ||
| conn.setSchema(schema) |
There was a problem hiding this comment.
Hmm... is this really safe? I think StarRocks and Doris support both catalog and schema ...
There was a problem hiding this comment.
Good point to think through. The dual setCatalog + setSchema is safe because mysql-connector-j picks one of the two to actually send and silently swallows the other inside the driver, based on its databaseTerm connection property: with the default CATALOG, setCatalog sends a "switch database" command and setSchema does nothing; with SCHEMA it's the reverse. So no matter how the user configured the URL, the StarRocks/Doris server only ever receives one "switch database" command, never two.
Confirmed by reading the driver source — both methods funnel into the same setDatabase(...) call after the databaseTerm check, see ConnectionImpl.java#L2145. setCatalog is literally if (databaseTerm == CATALOG) setDatabase(catalog); and setSchema is the symmetric if (databaseTerm == SCHEMA) setDatabase(schema); — exactly one branch fires per call site, neither throws when its branch is skipped.
The server treats that command as "switch the current database" — it does not treat it as "switch catalog" (catalog switching on StarRocks/Doris is a separate SET CATALOG SQL statement, not something the MySQL protocol can do). The name we pass in comes from USE_DATABASE, which the Hive JDBC driver parses out of the URL path. The Hive URL syntax has no concept of catalog, so this name is always a database in the current (default) catalog.
| } else { | ||
| schemaName | ||
| } | ||
| val query = dialect.getTablesQuery(catalogName, effectiveSchema, tableName, tableTypes) |
There was a problem hiding this comment.
I don't recall the details of the initial code of this part, but the API looks problematic.
the Hive Thrift GetTablesOperation API is derived from the JDBC API, and that's why they look almost identical, for JDBC engine, such a call should directly map to the JDBC API instead of requesting a SQL (if you take a look at the JDBC driver implementation, you may find some impls are non-SQL-based), the current design complicates the whole thing.
There was a problem hiding this comment.
Thanks for raising this — you're right, and I want to confirm I took it seriously rather than hand-wave it away. To make sure I understood, I ran the full set of DatabaseMetaData calls against three different drivers (mysql-connector-j 8.4.0, postgresql 42.7.2, clickhouse-jdbc 0.6.5) using a small Java probe and docker containers. Test fixtures were tailored to each backend (MySQL/PG had PK + FK + a function; ClickHouse used MergeTree which has no native PK/FK).
Cross-driver behavior of each API:
| API | MySQL (databaseTerm=CATALOG) |
MySQL (databaseTerm=SCHEMA) |
PostgreSQL | ClickHouse |
|---|---|---|---|---|
getCatalogs() |
✓ implemented | ⚠ empty (driver databaseTerm quirk) |
✓ implemented | ✓ implemented |
getSchemas() |
⚠ empty (driver databaseTerm quirk) |
✓ implemented | ✓ implemented | — model has no schema layer |
getTableTypes() |
✓ | ✓ | ✓ | ✓ |
getTables(...) |
✓ | ✓ | ✓ | ✓ |
getColumns(...) |
✓ | ✓ | ✓ | ✓ |
getPrimaryKeys |
✓ | ✓ | ✓ | — MergeTree has no PK |
getImported / Exported / CrossReference |
✓ | ✓ | ✓ | — no FK in CK |
getFunctions(...) |
✓ | ✓ | ✓ | (no fixture created) |
getTypeInfo() |
✓ | ✓ | ✓ | ✓ |
Things that hold across all three drivers:
- No
SQLFeatureNotSupportedExceptionanywhere — every JDBC-standard metadata method is implemented and returns aResultSetrather than throwing, even when the data model doesn't populate it. - Column names follow the JDBC spec (
TABLE_CAT,TABLE_SCHEM,TABLE_NAME,COLUMN_NAME,DATA_TYPE,KEY_SEQ,PK_NAME, ...) — which is exactly whatGetTablesOperation/GetColumnsOperation/ etc. on the Hive Thrift side expect, since the Thrift API was derived from the JDBC API in the first place. - Driver-specific quirks are minor and well-defined —
databaseTermgating is unique to mysql-connector-j; PG returns column labels in lowercase while MySQL/CK use uppercase (JDBC spec says case-insensitive, but engine-side string comparisons would need to match).
So the JDBC standard API really does cover every operation the dialect models today, across very different DBMS families. Concretely, this means the JdbcDialect.getXxxQuery family of methods could be deleted in favor of the engine calling connection.getMetaData.getXxx(...) directly — and dialects whose driver implements metadata via a non-SQL path (Avatica, etc.) would automatically be used the way they were designed to.
The few wrinkles I hit are all driver-side and resolvable in a few lines of engine-level adapter code, not by reaching for SQL:
getCatalogs/getSchemasare mutually exclusive on mysql-connector-j (gated bydatabaseTerm) — handled by checking the property and routing, or by calling both and taking whichever is non-empty.DATA_TYPEisintin the JDBC spec but the current SQL returns a string —java.sql.Types-to-name mapping is one helper.- The current SQL returns extra MySQL-specific columns (
ENGINE,TABLE_ROWS, ...) but the Hive ThriftGetTablesresult schema is fixed to JDBC-standard columns, so those extras are already dropped during serialization today and aren't reaching clients — no real loss.
I'd like to keep this PR scoped to the URL-database bug since that's what it set out to fix, and migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test — not something I'd want to fold into a bug fix. Happy to open a follow-up issue tracking the migration if you agree with that split. WDYT?
There was a problem hiding this comment.
migrating the dialect API to call DatabaseMetaData directly will touch every dialect, the operation manager, and every metadata test
if we do that first, do we still need this patch?
There was a problem hiding this comment.
No SQLFeatureNotSupportedException anywhere
Actually, the JDBC API Javadocs define behavior in a very clear way, it should not throw SQLFeatureNotSupportedException unless Javadocs say it can. Some tools, like DBeaver, may tolerate it if the JDBC driver vendor does not respect the API contract, but tools like DataGrip may not.
There was a problem hiding this comment.
BTW, we MUST avoid copying the code from mysql-connector-j, it's GPL-licensed ... try to read the docs or use black box testing to tackle the mysql-connector-j-related issue. (this will be a horrible experience ...)
Why are the changes needed?
This is a bug fix.
The Hive JDBC driver rewrites a
nullschemaPatternto"%"inGetTables, and the JDBC engine forwards it directly intoWHERE TABLE_SCHEMA LIKE '%', so DBeaver — which always passesnullonce a connection is established — sees tables from every database in the backend rather than the one in its connection URL. Compounding this, the JDBC engine never applies the URL-level database to the underlying JDBC connection, so there is no "current database" on the backend for a filter to fall back to. The fix introduces a dialect-levelsetCurrentDatabasehook, tracks aneffectiveDatabaseon the session, and routes blank/"%"schema patterns through it inGetTablesOperation.MySQLDialectadditionally implementsgetCatalogsOperation/getSchemasOperationwith JDBC-standard column labels (TABLE_CATALOG/TABLE_SCHEM) so the schemas tree in DBeaver populates correctly.Non-MySQL dialects inherit the default no-op
setCurrentDatabaseandeffectiveDatabasestaysNone, so PostgreSQL / Oracle / etc. keep their existing behavior.Note:
getSchemas()continues to return all accessible schemas per the JDBC spec — the URL-level database only affects the default query context and the blank/"%"fallback ingetTables. This matches the behavior of the native MySQL JDBC driver. Clients that want to restrict the visible schema list should do so via client-side filters (e.g. DBeaver's "Schemas" filter).How was this patch tested?
Added unit / integration tests in
MySQLOperationSuitecovering:getTableswith a database in the URL returns only that database's tablesgetTableswith a"%"schema pattern is routed to the session's effective databasegetSchemasreturns results with JDBC-standardTABLE_CATALOG/TABLE_SCHEMcolumn labelsgetCatalogsbehaviorAlso verified end-to-end against a live StarRocks backend using Spark's Hive-JDBC beeline:
Was this patch authored or co-authored using generative AI tooling?
Human-authored core design and implementation. Tests, verification, and PR description drafting were assisted by Claude Code.
Generated-by: Claude Code (Opus 4.7)
Closes #7305. Supersedes the stale #7307 per @pan3793's suggestion there.
Co-authored-by: zhzhenqin zhzhenqin@users.noreply.github.com