-
Notifications
You must be signed in to change notification settings - Fork 103
add feature to create tables from pyarrow objects #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add feature to create tables from pyarrow objects #597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds functionality to create ClickHouse tables from PyArrow schema objects, addressing issue #588. It introduces two new helper functions that map PyArrow types to ClickHouse types and generate CREATE TABLE statements.
Key changes:
- New
arrow_schema_to_column_defs()function converts PyArrow schemas to ClickHouse column definitions with support for core scalar types (integers, floats, strings, booleans) - New
create_table_from_arrow_schema()convenience wrapper that combines schema conversion with table creation - Comprehensive integration tests covering basic type mappings, unsupported types, and DDL generation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| clickhouse_connect/driver/ddl.py | Implements PyArrow-to-ClickHouse type mapping and schema conversion functions |
| tests/integration_tests/test_pyarrow_ddl.py | Adds integration tests for PyArrow schema conversion and table creation |
| CHANGELOG.md | Documents the new feature in the unreleased improvements section |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
77bcd68 to
a3396c0
Compare
|
addressed some changes, can i get some eyes on it whenever y'all get a chance 🙌 |
|
Sure thing @akkik04, thanks for the contribution! I'll review by tomorrow. |
71a8274 to
1c4cf8b
Compare
joe-clickhouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @akkik04! In general I think this looks pretty good. One thing that did cross my mind during review that we'll need to discuss/work through is nullable behavior. Arrow fields are nullable by default. However, this implementation creates non-nullalbe columns in ClickHouse. I wouldn't want to just automatically wrap everything in Nullable() since that does have performance implications. There are a couple of ways to approach this depending on what the expected workflow here is.
The tests just build up a schema from scratch (no data, just metadata) and use that to convert to TableColumnDef objects. This is fine for unit tests but I'm going to assume that the user will most likely have an actual arrow table with data already (if you guys envision this differently, let me know) otherwise, I'd argue they should just build the ClickHouse table in the traditional way with raw SQL in a client.command() statement.
Assuming they do have a table with data, we can go one of two ways:
- null checks on arrow columns should be super cheap so we can check each column:
for column in arrow_table.columns:
has_nulls = column.null_count > 0and if no nulls are found, then everything is fine as it was. If nulls are found, then wrap the type name in Nullable() before creating the TableColumnDef
ch_type_name = f"Nullable({ch_type_name})"- take an 'optimistic non-null' approach and create non-nullable columns by default. If the user tries to insert an arrow table with nulls, they'll get a clear ClickHouse error indicating which column has nulls, and they can adjust the DDL accordingly. This keeps the common case (no nulls) performant while still catching issues immediately. Note that this approach assumes the nullability characteristics of the initial data are representative of future inserts. Users can always manually adjust the DDL if their data patterns change.
I'm inclined at this point to take approach 2 because it's simpler and optimizes for the common case (no nulls). The error will be immediate and actionable.
clickhouse_connect/driver/ddl.py
Outdated
| if pa is None: | ||
| raise ImportError( | ||
| "PyArrow is required, but it is not installed." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually a utility in driver/options.py that'll do this for you. So you can replace the try/except above on lines 5-8 with from clickhouse_connect.driver.options import check_arrow and then here in the _arrow_type_to_ch function, just replace the if pa is None check/raise with pa = check_arrow()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice and convenient, will work it in.
Thanks for calling out the nullable behaviour, that makes sense. Right now, the helper is effectively following your option (2): it always generates non-nullable ClickHouse types and doesn't infer I don't have to make any functional changes to accommodate this chosen/desired behaviour, however, I'll update the docstring to explicitly document this "optimistic non-null" behaviour so the intent is clear. |
|
bundled those changes we discussed into a commit. feel free to take a look when you get a chance @joe-clickhouse 🙌 |
joe-clickhouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! Only thing you'll need to do now is run pylint tests and pylint clickhouse_connect and address those issues as the workflow won't even run until those issues are addressed.
Done. |
joe-clickhouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for the contribution.
Summary
This PR implements the helper requested in #588 for creating ClickHouse tables from PyArrow schemas.
arrow_schema_to_column_defs(schema: pa.Schema) -> list[TableColumnDef]that converts apyarrow.SchemaintoTableColumnDefinstances.create_table_from_arrow_schema(table_name, schema, engine, engine_params)as a convenience wrapper that reuses the existingcreate_tablehelper.pa.int8/16/32/64→Int8/16/32/64pa.uint8/16/32/64→UInt8/16/32/64pa.float16/float32→Float32pa.float64→Float64pa.string()/pa.large_string()→Stringpa.bool_()→BoolTypeErrorso callers are explicitly aware that automatic mapping is not yet implemented.This allows patterns like:
Checklist:
Closes #588