DbType2#1632
Conversation
…ableColumnMetadata` in `generateTypeInformation()`. This contains potential pre- and post-processing logic for any type
…recursive preprocessing
6b39cc6 to
9c9a699
Compare
…s, that use JdbcTypeMapping
…some types by default
| val columnKTypes = buildColumnKTypes(tableColumns, dbType) | ||
| val columnData = readAllRowsFromResultSet(rs, tableColumns, columnKTypes, dbType, limit) | ||
| val dataFrame = buildDataFrameFromColumnData(columnData, tableColumns, columnKTypes, dbType, inferNullability) | ||
| val expectedJdbcTypes = getExpectedJdbcTypes( |
There was a problem hiding this comment.
ExpectedJdbcTypes could be more complex structure, but better to keep order of column according indicies for debugging, not name-based, it could be edge object with fields: index, name, KType for example
There was a problem hiding this comment.
That's certainly possible! Though that would look a bit more like AdvancedDbType which, for each column, requires you to provide an AnyJdbcToDataFrameConverter containing all information needed to read and convert that column. Maybe the concept could be merged
| dbType = dbType, | ||
| tableColumns = tableColumns, | ||
| ) | ||
| val preprocessedValueTypes = getPreprocessedValueTypes( |
There was a problem hiding this comment.
Describe somewhere the processes
Expected->Preprocessed (what's the difference and why we need this step, what does it give)
| ): DataFrameSchema { | ||
| val determinedDbType = dbType ?: extractDBTypeFromConnection(connection) | ||
|
|
||
| // TODO don't need to read 1 row, take it just from TableColumnMetadatas |
There was a problem hiding this comment.
It's very safe and cheap way for any database, I moved from taking info from TableColumnMetadata, also lead to less error - prone in our codebase and make it flexible
There was a problem hiding this comment.
Except for empty databases that just have a schema and no data. When building just the schema, we shouldn't have to look at the actual database contents. It's up to the DbType implementor to provide a watertight getExpectedJdbcType(), getPreprocessedValueType(), and getTargetColumnSchema() from just TableColumnMetadata. So if we manage to construct TableColumnMetadatas from connection, tableName and dbType, we can simply call those functions to get a schema without accessing any data.
There was a problem hiding this comment.
...and for databases you may have no query access to
| columnIndex: Int, | ||
| tableColumnMetadata: TableColumnMetadata, | ||
| expectedJdbcType: KType, | ||
| ): J? = |
There was a problem hiding this comment.
J-DBC. It's clearer in the context of AdvancedDbType:
J-DBC typeD-ataFrame typeP-ost processed type
…ed. Added `getDataFrameCompatibleColumnNames()` function to handle missing or duplicate names, as they can apparently appear from sql but will break DF
…lder-like pattern
|
@zaleslaw I'm not sure what to do about |
…classes for each module
There was a problem hiding this comment.
Pull request overview
This PR represents a major refactoring of the JDBC module's type inference and conversion system, introducing a new three-stage pipeline architecture for converting database types to DataFrame columns. The changes address several long-standing issues related to type conversions, nested types (STRUCT, ARRAY, MAP), and date/time handling.
Changes:
- Introduced
JdbcToDataFrameConverterto encapsulate the three-stage type conversion pipeline (JDBC type → preprocessed value → final DataColumn) - Added
AdvancedDbTypeabstract class for databases with complex type systems (like DuckDB) - Refactored
DbTypeAPI: removedconvertSqlTypeToKType/convertSqlTypeToColumnSchemaValuein favor ofgetExpectedJdbcType/getPreprocessedValueType/getTargetColumnSchema/buildDataColumn - Updated DuckDB implementation to support STRUCT, MAP, ARRAY, and JSON types with proper conversions
- Changed column name deduplication from underscore-separated ("name_1") to no separator ("name1")
- Updated BuildConfig package naming to include project name
- Updated DuckDB version from 1.3.1.0 to 1.4.2.0
- Converted Java types to Kotlin equivalents (java.sql.Timestamp → kotlin.time.Instant, java.util.UUID → kotlin.uuid.Uuid, etc.)
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DbType.kt | Core refactoring: new API methods for type conversion pipeline, removed old methods |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/AdvancedDbType.kt | New abstract class for databases with complex type mappings using JdbcToDataFrameConverter |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/JdbcToDataFrameConverter.kt | New converter infrastructure with type-safe preprocessing and column building |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DuckDb.kt | Major refactoring to extend AdvancedDbType, add support for STRUCT/MAP/ARRAY/JSON types |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt | Refactored to use new three-stage pipeline, added schema validation in DEBUG mode |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readDataFrameSchema.kt | Updated to use new API methods |
| dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/*.kt | Updated all database implementations (PostgreSql, MySql, MariaDb, MsSql, H2, Sqlite) to new API |
| dataframe-jdbc/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/local/duckDbTest.kt | Updated tests for new type conversions and nested types support |
| dataframe-jdbc/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/h2/*.kt | Updated expected types to match Kotlin conversions (Timestamp → Instant) |
| build-logic/src/main/kotlin/dfbuild.buildConfig.gradle.kts | Changed BuildConfig package naming to include project name |
| gradle/libs.versions.toml | Updated DuckDB version to 1.4.2.0 |
| core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/ColumnNameGenerator.kt | Column name deduplication now uses numeric suffixes without separators |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| schema.compare(Person.expected.schema()).isSuperOrMatches() shouldBe true | ||
| withClue({ |
There was a problem hiding this comment.
It looks like we have code in commonTestScenarios with the same intention, could we move this schema validation there unify it together somehow?
9159c44 to
6b1802e
Compare
|
@zaleslaw I reverted Instead, I fixed #537 and its "money" type by only calling All local tests pass now |
Fixes #1273
Fixes #1587
Fixes #461
Fixes #537
Helps #387
Takes care of #462, but only for DuckDB.
Follows up on #1266 and #462
DbTypewas overhauled greatly:convertSqlTypeToColumnSchemaValueandconvertSqlTypeToKTypein favor ofgetExpectedJdbcTypeextractValueFromResultSetin favor ofgetValueFromResultSetResultSetinDbType:getValueFromResultSet()retrieves a value from aResultSetgiven a column and row with the type given bygetExpectedJdbcType()preprocessValue()with the new type given bygetPreprocessedValueType()buildDataColumn()with the new schema given bygetTargetColumnSchema()ColumnNameGeneratorfor consistent column name repair across IO readers #387LocalDateTimeto Kotlin'sLocalDateTime,java.sql.TimeStampto KotlinInstantand UUIDs to KotlinUuid. More preprocessing might follow later.AdvancedDbType
AdvancedDbType+JdbcToDataFrameConverterfor easier Type-conversion logicDuckDB
AdvancedDbTypeOther:
BuildConfignow includes the module name to not clash with other modules.