Skip to content

Conversation

@akkik04
Copy link
Contributor

@akkik04 akkik04 commented Nov 26, 2025

Summary

This PR implements the helper requested in #588 for creating ClickHouse tables from PyArrow schemas.

  • Adds a new utility arrow_schema_to_column_defs(schema: pa.Schema) -> list[TableColumnDef] that converts a pyarrow.Schema into TableColumnDef instances.
  • Adds create_table_from_arrow_schema(table_name, schema, engine, engine_params) as a convenience wrapper that reuses the existing create_table helper.
  • Supports core scalar Arrow types and maps them to ClickHouse types:
    • pa.int8/16/32/64Int8/16/32/64
    • pa.uint8/16/32/64UInt8/16/32/64
    • pa.float16/float32Float32
    • pa.float64Float64
    • pa.string()/pa.large_string()String
    • pa.bool_()Bool
  • For other Arrow types, the helper raises TypeError so callers are explicitly aware that automatic mapping is not yet implemented.

This allows patterns like:

arrow_table = pa.table(...)
col_defs = arrow_schema_to_column_defs(arrow_table.schema)
ddl = create_table("my_table", col_defs, "MergeTree", {"ORDER BY": "id"})
client.command(ddl)
client.insert_arrow("my_table", arrow_table)

Checklist:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

Closes #588

@CLAassistant
Copy link

CLAassistant commented Nov 26, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to create ClickHouse tables from PyArrow schema objects, addressing issue #588. It introduces two new helper functions that map PyArrow types to ClickHouse types and generate CREATE TABLE statements.

Key changes:

  • New arrow_schema_to_column_defs() function converts PyArrow schemas to ClickHouse column definitions with support for core scalar types (integers, floats, strings, booleans)
  • New create_table_from_arrow_schema() convenience wrapper that combines schema conversion with table creation
  • Comprehensive integration tests covering basic type mappings, unsupported types, and DDL generation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
clickhouse_connect/driver/ddl.py Implements PyArrow-to-ClickHouse type mapping and schema conversion functions
tests/integration_tests/test_pyarrow_ddl.py Adds integration tests for PyArrow schema conversion and table creation
CHANGELOG.md Documents the new feature in the unreleased improvements section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@akkik04 akkik04 force-pushed the feature/arrow-schema-to-column-defs branch from 77bcd68 to a3396c0 Compare December 2, 2025 00:48
@akkik04
Copy link
Contributor Author

akkik04 commented Dec 2, 2025

addressed some changes, can i get some eyes on it whenever y'all get a chance 🙌
cc: @joe-clickhouse @mshustov

@joe-clickhouse
Copy link
Contributor

Sure thing @akkik04, thanks for the contribution! I'll review by tomorrow.

@akkik04 akkik04 force-pushed the feature/arrow-schema-to-column-defs branch from 71a8274 to 1c4cf8b Compare December 2, 2025 05:29
Copy link
Contributor

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akkik04! In general I think this looks pretty good. One thing that did cross my mind during review that we'll need to discuss/work through is nullable behavior. Arrow fields are nullable by default. However, this implementation creates non-nullalbe columns in ClickHouse. I wouldn't want to just automatically wrap everything in Nullable() since that does have performance implications. There are a couple of ways to approach this depending on what the expected workflow here is.

The tests just build up a schema from scratch (no data, just metadata) and use that to convert to TableColumnDef objects. This is fine for unit tests but I'm going to assume that the user will most likely have an actual arrow table with data already (if you guys envision this differently, let me know) otherwise, I'd argue they should just build the ClickHouse table in the traditional way with raw SQL in a client.command() statement.

Assuming they do have a table with data, we can go one of two ways:

  1. null checks on arrow columns should be super cheap so we can check each column:
for column in arrow_table.columns:
    has_nulls = column.null_count > 0

and if no nulls are found, then everything is fine as it was. If nulls are found, then wrap the type name in Nullable() before creating the TableColumnDef

ch_type_name = f"Nullable({ch_type_name})"
  1. take an 'optimistic non-null' approach and create non-nullable columns by default. If the user tries to insert an arrow table with nulls, they'll get a clear ClickHouse error indicating which column has nulls, and they can adjust the DDL accordingly. This keeps the common case (no nulls) performant while still catching issues immediately. Note that this approach assumes the nullability characteristics of the initial data are representative of future inserts. Users can always manually adjust the DDL if their data patterns change.

I'm inclined at this point to take approach 2 because it's simpler and optimizes for the common case (no nulls). The error will be immediate and actionable.

Comment on lines 43 to 46
if pa is None:
raise ImportError(
"PyArrow is required, but it is not installed."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a utility in driver/options.py that'll do this for you. So you can replace the try/except above on lines 5-8 with from clickhouse_connect.driver.options import check_arrow and then here in the _arrow_type_to_ch function, just replace the if pa is None check/raise with pa = check_arrow()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice and convenient, will work it in.

@akkik04
Copy link
Contributor Author

akkik04 commented Dec 3, 2025

Thanks @akkik04! In general I think this looks pretty good. One thing that did cross my mind during review that we'll need to discuss/work through is nullable behavior. Arrow fields are nullable by default. However, this implementation creates non-nullalbe columns in ClickHouse. I wouldn't want to just automatically wrap everything in Nullable() since that does have performance implications. There are a couple of ways to approach this depending on what the expected workflow here is.

The tests just build up a schema from scratch (no data, just metadata) and use that to convert to TableColumnDef objects. This is fine for unit tests but I'm going to assume that the user will most likely have an actual arrow table with data already (if you guys envision this differently, let me know) otherwise, I'd argue they should just build the ClickHouse table in the traditional way with raw SQL in a client.command() statement.

Assuming they do have a table with data, we can go one of two ways:

  1. null checks on arrow columns should be super cheap so we can check each column:
for column in arrow_table.columns:
    has_nulls = column.null_count > 0

and if no nulls are found, then everything is fine as it was. If nulls are found, then wrap the type name in Nullable() before creating the TableColumnDef

ch_type_name = f"Nullable({ch_type_name})"
  1. take an 'optimistic non-null' approach and create non-nullable columns by default. If the user tries to insert an arrow table with nulls, they'll get a clear ClickHouse error indicating which column has nulls, and they can adjust the DDL accordingly. This keeps the common case (no nulls) performant while still catching issues immediately. Note that this approach assumes the nullability characteristics of the initial data are representative of future inserts. Users can always manually adjust the DDL if their data patterns change.

I'm inclined at this point to take approach 2 because it's simpler and optimizes for the common case (no nulls). The error will be immediate and actionable.

Thanks for calling out the nullable behaviour, that makes sense.

Right now, the helper is effectively following your option (2): it always generates non-nullable ClickHouse types and doesn't infer Nullable from the PyArrow schema or data. If a user inserts a PyArrow table with nulls, ClickHouse will raise an error and they can adjust the DDL (e.g., wrap specific columns in Nullable(...).

I don't have to make any functional changes to accommodate this chosen/desired behaviour, however, I'll update the docstring to explicitly document this "optimistic non-null" behaviour so the intent is clear.

@akkik04
Copy link
Contributor Author

akkik04 commented Dec 3, 2025

bundled those changes we discussed into a commit. feel free to take a look when you get a chance @joe-clickhouse 🙌

Copy link
Contributor

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! Only thing you'll need to do now is run pylint tests and pylint clickhouse_connect and address those issues as the workflow won't even run until those issues are addressed.

@akkik04
Copy link
Contributor Author

akkik04 commented Dec 3, 2025

Thanks for the fixes! Only thing you'll need to do now is run pylint tests and pylint clickhouse_connect and address those issues as the workflow won't even run until those issues are addressed.

Done.

Copy link
Contributor

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the contribution.

@joe-clickhouse joe-clickhouse merged commit 70bd2d2 into ClickHouse:main Dec 3, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to create table from pyarrow objects

3 participants