add feature to create tables from pyarrow objects #597

akkik04 · 2025-11-26T07:51:26Z

Summary

This PR implements the helper requested in #588 for creating ClickHouse tables from PyArrow schemas.

Adds a new utility arrow_schema_to_column_defs(schema: pa.Schema) -> list[TableColumnDef] that converts a pyarrow.Schema into TableColumnDef instances.
Adds create_table_from_arrow_schema(table_name, schema, engine, engine_params) as a convenience wrapper that reuses the existing create_table helper.
Supports core scalar Arrow types and maps them to ClickHouse types:
- pa.int8/16/32/64 → Int8/16/32/64
- pa.uint8/16/32/64 → UInt8/16/32/64
- pa.float16/float32 → Float32
- pa.float64 → Float64
- pa.string()/pa.large_string() → String
- pa.bool_() → Bool
For other Arrow types, the helper raises TypeError so callers are explicitly aware that automatic mapping is not yet implemented.

This allows patterns like:

arrow_table = pa.table(...)
col_defs = arrow_schema_to_column_defs(arrow_table.schema)
ddl = create_table("my_table", col_defs, "MergeTree", {"ORDER BY": "id"})
client.command(ddl)
client.insert_arrow("my_table", arrow_table)

Checklist:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

Closes #588

CLAassistant · 2025-11-26T07:51:33Z

All committers have signed the CLA.

Copilot

Pull request overview

This PR adds functionality to create ClickHouse tables from PyArrow schema objects, addressing issue #588. It introduces two new helper functions that map PyArrow types to ClickHouse types and generate CREATE TABLE statements.

Key changes:

New arrow_schema_to_column_defs() function converts PyArrow schemas to ClickHouse column definitions with support for core scalar types (integers, floats, strings, booleans)
New create_table_from_arrow_schema() convenience wrapper that combines schema conversion with table creation
Comprehensive integration tests covering basic type mappings, unsupported types, and DDL generation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
clickhouse_connect/driver/ddl.py	Implements PyArrow-to-ClickHouse type mapping and schema conversion functions
tests/integration_tests/test_pyarrow_ddl.py	Adds integration tests for PyArrow schema conversion and table creation
CHANGELOG.md	Documents the new feature in the unreleased improvements section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

clickhouse_connect/driver/ddl.py

akkik04 · 2025-12-02T01:09:02Z

addressed some changes, can i get some eyes on it whenever y'all get a chance 🙌
cc: @joe-clickhouse @mshustov

joe-clickhouse · 2025-12-02T02:08:28Z

Sure thing @akkik04, thanks for the contribution! I'll review by tomorrow.

joe-clickhouse

Thanks @akkik04! In general I think this looks pretty good. One thing that did cross my mind during review that we'll need to discuss/work through is nullable behavior. Arrow fields are nullable by default. However, this implementation creates non-nullalbe columns in ClickHouse. I wouldn't want to just automatically wrap everything in Nullable() since that does have performance implications. There are a couple of ways to approach this depending on what the expected workflow here is.

The tests just build up a schema from scratch (no data, just metadata) and use that to convert to TableColumnDef objects. This is fine for unit tests but I'm going to assume that the user will most likely have an actual arrow table with data already (if you guys envision this differently, let me know) otherwise, I'd argue they should just build the ClickHouse table in the traditional way with raw SQL in a client.command() statement.

Assuming they do have a table with data, we can go one of two ways:

null checks on arrow columns should be super cheap so we can check each column:

for column in arrow_table.columns:
    has_nulls = column.null_count > 0

and if no nulls are found, then everything is fine as it was. If nulls are found, then wrap the type name in Nullable() before creating the TableColumnDef

ch_type_name = f"Nullable({ch_type_name})"

take an 'optimistic non-null' approach and create non-nullable columns by default. If the user tries to insert an arrow table with nulls, they'll get a clear ClickHouse error indicating which column has nulls, and they can adjust the DDL accordingly. This keeps the common case (no nulls) performant while still catching issues immediately. Note that this approach assumes the nullability characteristics of the initial data are representative of future inserts. Users can always manually adjust the DDL if their data patterns change.

I'm inclined at this point to take approach 2 because it's simpler and optimizes for the common case (no nulls). The error will be immediate and actionable.

joe-clickhouse · 2025-12-02T22:41:23Z

clickhouse_connect/driver/ddl.py

+    if pa is None:
+        raise ImportError(
+            "PyArrow is required, but it is not installed."
+        )


There's actually a utility in driver/options.py that'll do this for you. So you can replace the try/except above on lines 5-8 with from clickhouse_connect.driver.options import check_arrow and then here in the _arrow_type_to_ch function, just replace the if pa is None check/raise with pa = check_arrow()

This is nice and convenient, will work it in.

clickhouse_connect/driver/ddl.py

tests/unit_tests/test_pyarrow_ddl.py

akkik04 · 2025-12-03T02:05:57Z

Thanks @akkik04! In general I think this looks pretty good. One thing that did cross my mind during review that we'll need to discuss/work through is nullable behavior. Arrow fields are nullable by default. However, this implementation creates non-nullalbe columns in ClickHouse. I wouldn't want to just automatically wrap everything in Nullable() since that does have performance implications. There are a couple of ways to approach this depending on what the expected workflow here is.

The tests just build up a schema from scratch (no data, just metadata) and use that to convert to TableColumnDef objects. This is fine for unit tests but I'm going to assume that the user will most likely have an actual arrow table with data already (if you guys envision this differently, let me know) otherwise, I'd argue they should just build the ClickHouse table in the traditional way with raw SQL in a client.command() statement.

Assuming they do have a table with data, we can go one of two ways:

null checks on arrow columns should be super cheap so we can check each column:
for column in arrow_table.columns:
    has_nulls = column.null_count > 0
and if no nulls are found, then everything is fine as it was. If nulls are found, then wrap the type name in Nullable() before creating the TableColumnDef
ch_type_name = f"Nullable({ch_type_name})"
take an 'optimistic non-null' approach and create non-nullable columns by default. If the user tries to insert an arrow table with nulls, they'll get a clear ClickHouse error indicating which column has nulls, and they can adjust the DDL accordingly. This keeps the common case (no nulls) performant while still catching issues immediately. Note that this approach assumes the nullability characteristics of the initial data are representative of future inserts. Users can always manually adjust the DDL if their data patterns change.

I'm inclined at this point to take approach 2 because it's simpler and optimizes for the common case (no nulls). The error will be immediate and actionable.

Thanks for calling out the nullable behaviour, that makes sense.

Right now, the helper is effectively following your option (2): it always generates non-nullable ClickHouse types and doesn't infer Nullable from the PyArrow schema or data. If a user inserts a PyArrow table with nulls, ClickHouse will raise an error and they can adjust the DDL (e.g., wrap specific columns in Nullable(...).

I don't have to make any functional changes to accommodate this chosen/desired behaviour, however, I'll update the docstring to explicitly document this "optimistic non-null" behaviour so the intent is clear.

akkik04 · 2025-12-03T02:16:33Z

bundled those changes we discussed into a commit. feel free to take a look when you get a chance @joe-clickhouse 🙌

joe-clickhouse

Thanks for the fixes! Only thing you'll need to do now is run pylint tests and pylint clickhouse_connect and address those issues as the workflow won't even run until those issues are addressed.

clickhouse_connect/driver/ddl.py

akkik04 · 2025-12-03T20:29:59Z

Thanks for the fixes! Only thing you'll need to do now is run pylint tests and pylint clickhouse_connect and address those issues as the workflow won't even run until those issues are addressed.

Done.

joe-clickhouse

Looks good! Thanks for the contribution.

akkik04 mentioned this pull request Nov 26, 2025

Ability to create table from pyarrow objects #588

Closed

mshustov requested review from Copilot and joe-clickhouse December 1, 2025 13:10

Copilot AI reviewed Dec 1, 2025

View reviewed changes

clickhouse_connect/driver/ddl.py Outdated Show resolved Hide resolved

clickhouse_connect/driver/ddl.py Outdated Show resolved Hide resolved

clickhouse_connect/driver/ddl.py Show resolved Hide resolved

clickhouse_connect/driver/ddl.py Outdated Show resolved Hide resolved

akkik04 force-pushed the feature/arrow-schema-to-column-defs branch from 77bcd68 to a3396c0 Compare December 2, 2025 00:48

akkik04 added 3 commits December 2, 2025 00:27

working addition for creating tables from pyarrow objects

3a54ea9

update the changelog

4f0d9ce

address PR review

1c4cf8b

akkik04 force-pushed the feature/arrow-schema-to-column-defs branch from 71a8274 to 1c4cf8b Compare December 2, 2025 05:29

joe-clickhouse reviewed Dec 2, 2025

View reviewed changes

more pr fixes

8f8da0b

akkik04 requested a review from joe-clickhouse December 3, 2025 02:16

joe-clickhouse reviewed Dec 3, 2025

View reviewed changes

clickhouse_connect/driver/ddl.py Outdated Show resolved Hide resolved

ran linting + updated docstring in ddl function

7773276

akkik04 requested a review from joe-clickhouse December 3, 2025 20:30

linting again

4da3ab0

joe-clickhouse approved these changes Dec 3, 2025

View reviewed changes

joe-clickhouse merged commit 70bd2d2 into ClickHouse:main Dec 3, 2025
35 checks passed

add feature to create tables from pyarrow objects #597

add feature to create tables from pyarrow objects #597

Uh oh!

Conversation

akkik04 commented Nov 26, 2025 • edited by joe-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist:

Uh oh!

CLAassistant commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akkik04 commented Dec 2, 2025

Uh oh!

joe-clickhouse commented Dec 2, 2025

Uh oh!

joe-clickhouse left a comment

Choose a reason for hiding this comment

Uh oh!

joe-clickhouse Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

akkik04 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akkik04 commented Dec 3, 2025

Uh oh!

akkik04 commented Dec 3, 2025

Uh oh!

joe-clickhouse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akkik04 commented Dec 3, 2025

Uh oh!

joe-clickhouse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akkik04 commented Nov 26, 2025 •

edited by joe-clickhouse

Loading

CLAassistant commented Nov 26, 2025 •

edited

Loading