-
Notifications
You must be signed in to change notification settings - Fork 287
Feat/2200 add clickhouse distributed support #2573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Feat/2200 add clickhouse distributed support #2573
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this @zstipanicev !
what I'd ideally do here:
- try to implement all DDL statements by changing
clickhouse
specific code without rewriting the queries:
- CREATE?ALTER TABLE:
def _get_table_update_sql(
you can call SQL generation twice with different table names. you can also temporarily change database name for distributed table - I do not think you need to create distributed pair for SENTINEL table (we mark existence of dataset with it)
- DROP: in
drop_tables
. we also have duplicate code indrop_dataset
, you can just calldrop_tables
from there.
Other queries it is indeed best to rewrite. Are you familiar with SQLGlot? we have it as required dependency. This is for example you need to do to add GLOBAL join:
# 1. Parse using the ClickHouse dialect
tree = sqlglot.parse_one(sql, read="clickhouse")
# 2. Walk the AST once and flip the `global` flag
def _add_global(node: exp.Expression) -> exp.Expression:
if isinstance(node, exp.Join) and not node.args.get("global"):
node.set("global", True) # <- the magic line
return node
tree = tree.transform(_add_global)
# 3. Re-emit SQL for ClickHouse
return tree.sql(dialect="clickhouse")
renaming tables is also easy. we are trying to port all our SQL generation to sqlglot (slowly)
this OFC depends how much time do you have. if you cannot do those changes, we'll take over (this fix makes a lot of sense) but that will take ~2 releases to complete
"""Set to True if ClickHouse tables are distributed/shareded across multiple nodes, this will enable creating base and distributed tables.""" | ||
cluster: Optional[str] = None | ||
"""Cluster name for sharded tables. This is used in ON CLUSTER clause for sharded distributed tables""" | ||
base_table_database_prefix: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to have two databases/catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At my current company, this is our internal rule/best practice, we store base tables in "_db" with added "_base" postfix and disitributed tables in "db". Only distributed tables are accessed directly by tools (dbt, reporting, ...)
If database prefix is left blank it would use the same database for both type of tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this makes sense! we'll need to document this workflow in our clickhouse docs (clickhouse.md AFAIK, also the default behavior)
"""Cluster name for sharded tables. This is used in ON CLUSTER clause for sharded distributed tables""" | ||
base_table_database_prefix: Optional[str] = None | ||
"""Prefix for the database name of the base table. This is used for sharded distributed tables.""" | ||
base_table_name_postfix: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have a separate database, why table names must differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to comply with my current company rules, it can be kept blank to keep the same name.
Having both options adds flexibility to support various setups in different companies.
# db name for the base table | ||
base_db = self.config.base_table_database_prefix + self.credentials.database | ||
|
||
if (self.contains_string(qry, "CREATE") and self.contains_string(qry, "ENGINE = Memory")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just change it in _to_temp_table
for all types of tables if that performs better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also a solution.
The change here is not about perfomance, memory engine tables exist on a single replica only and that doesn't work with distributed setup. If the next query reading from that table is connected to a different replica than the one used for inserting data, it would see an empty table.
The change here is to ensure we always see the data.
Sorry for not documenting that in the comment as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the _to_temp_table function I don't think it's possible to make the change there as it doesn't have access to the configuration and without it it's not possible to determine if it's distributed setup or not.
Is it possible change the class or the function to have access to the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can do that. in your PR you can change it to ReplicatedMergeTree
and then we can do the final tweak (passing config will need code refactor with which you should IMO not bother).
Thx for all the suggestions @rudolfix! |
please keep me posted, this PR is pretty cool so if you get stuck or don't have time lmk. |
… sqlglot to alter queries when possible
Hey @rudolfix I adjusted the PR.
I decided not to change the code within _to_temp_table because it's not only about the engine. The ON CLUSTER clause would also need to be added, but without access to the configured cluster name, it cannot be made generic. I used SQLGlot where possible. Unfortunately, SQLGlot doesn't fully support every ClickHouse statement, and I clearly stated this in the comments. P.S. SQLGlot changes ` to ", I couldn't find a way to change that behaviour Edit: Edit 2: |
…orking on disitributed setup without GLOBAL
Description
When Clickhouse is setup with replicas and distributed tables, DDL & DML statements need to be modified:
This was achieved by adding additional configuration parameters for Clickhouse and modifying queries in the execute_query function in the sql_client file.
This way we didn't change any core dlt functionality and all changes are restricted only to clickhouse destination and only in a single function.
And for Clickhouse changes will be applied only if the configuration is set for the distributed Clickhouse setup
To test the changes Clickhouse with a few replicated nodes is needed.
Related Issues
Additional Context