Feat/2200 add clickhouse distributed support #2573

zstipanicev · 2025-04-29T17:00:50Z

Description

When Clickhouse is setup with replicas and distributed tables, DDL & DML statements need to be modified:

ON CLUSTER needs to added after the table name
A pair of base table and a distributed table needs to be created
ALTER and DROP need to be executed for both tables, base and distributed
Deletes need to be done on the base table
Add GLOBAL to all joins as we can't know which do not needed

This was achieved by adding additional configuration parameters for Clickhouse and modifying queries in the execute_query function in the sql_client file.
This way we didn't change any core dlt functionality and all changes are restricted only to clickhouse destination and only in a single function.
And for Clickhouse changes will be applied only if the configuration is set for the distributed Clickhouse setup

To test the changes Clickhouse with a few replicated nodes is needed.

Related Issues

Resolves 2200 Cluster support for Clickhouse

Additional Context

netlify · 2025-04-29T17:01:38Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`5593683`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/6841c5ac9f976d00082ff682

rudolfix

thanks for this @zstipanicev !

what I'd ideally do here:

try to implement all DDL statements by changing clickhouse specific code without rewriting the queries:

CREATE?ALTER TABLE: def _get_table_update_sql( you can call SQL generation twice with different table names. you can also temporarily change database name for distributed table
I do not think you need to create distributed pair for SENTINEL table (we mark existence of dataset with it)
DROP: in drop_tables. we also have duplicate code in drop_dataset, you can just call drop_tables from there.

Other queries it is indeed best to rewrite. Are you familiar with SQLGlot? we have it as required dependency. This is for example you need to do to add GLOBAL join:

# 1.  Parse using the ClickHouse dialect
    tree = sqlglot.parse_one(sql, read="clickhouse")

    # 2.  Walk the AST once and flip the `global` flag
    def _add_global(node: exp.Expression) -> exp.Expression:
        if isinstance(node, exp.Join) and not node.args.get("global"):
            node.set("global", True)          # <- the magic line
        return node

    tree = tree.transform(_add_global)

    # 3.  Re-emit SQL for ClickHouse
    return tree.sql(dialect="clickhouse")

renaming tables is also easy. we are trying to port all our SQL generation to sqlglot (slowly)

this OFC depends how much time do you have. if you cannot do those changes, we'll take over (this fix makes a lot of sense) but that will take ~2 releases to complete

rudolfix · 2025-04-30T12:17:10Z

dlt/destinations/impl/clickhouse/configuration.py

+    """Set to True if ClickHouse tables are distributed/shareded across multiple nodes, this will enable creating base and distributed tables."""
+    cluster: Optional[str] = None
+    """Cluster name for sharded tables. This is used in ON CLUSTER clause for sharded distributed tables"""
+    base_table_database_prefix: Optional[str] = None


why we need to have two databases/catalog?

At my current company, this is our internal rule/best practice, we store base tables in "_db" with added "_base" postfix and disitributed tables in "db". Only distributed tables are accessed directly by tools (dbt, reporting, ...)
If database prefix is left blank it would use the same database for both type of tables

OK this makes sense! we'll need to document this workflow in our clickhouse docs (clickhouse.md AFAIK, also the default behavior)

rudolfix · 2025-04-30T12:17:24Z

dlt/destinations/impl/clickhouse/configuration.py

+    """Cluster name for sharded tables. This is used in ON CLUSTER clause for sharded distributed tables"""
+    base_table_database_prefix: Optional[str] = None
+    """Prefix for the database name of the base table. This is used for sharded distributed tables."""
+    base_table_name_postfix: Optional[str] = None


if we have a separate database, why table names must differ?

I'm just trying to comply with my current company rules, it can be kept blank to keep the same name.
Having both options adds flexibility to support various setups in different companies.

rudolfix · 2025-04-30T12:18:22Z

dlt/destinations/impl/clickhouse/sql_client.py

+                # db name for the base table
+                base_db = self.config.base_table_database_prefix + self.credentials.database
+
+                if (self.contains_string(qry, "CREATE") and self.contains_string(qry, "ENGINE = Memory")):


we can just change it in _to_temp_table for all types of tables if that performs better

That is also a solution.
The change here is not about perfomance, memory engine tables exist on a single replica only and that doesn't work with distributed setup. If the next query reading from that table is connected to a different replica than the one used for inserting data, it would see an empty table.
The change here is to ensure we always see the data.
Sorry for not documenting that in the comment as well.

After looking at the _to_temp_table function I don't think it's possible to make the change there as it doesn't have access to the configuration and without it it's not possible to determine if it's distributed setup or not.
Is it possible change the class or the function to have access to the config?

yes we can do that. in your PR you can change it to ReplicatedMergeTree and then we can do the final tweak (passing config will need code refactor with which you should IMO not bother).

zstipanicev · 2025-05-02T06:39:31Z

Thx for all the suggestions @rudolfix!
I'll have a look at those.

rudolfix · 2025-05-02T12:46:16Z

Thx for all the suggestions @rudolfix! I'll have a look at those.

please keep me posted, this PR is pretty cool so if you get stuck or don't have time lmk.

… sqlglot to alter queries when possible

zstipanicev · 2025-05-15T13:10:27Z

Hey @rudolfix

I adjusted the PR.

Changed CREATE, ALTER, and DROP statements in functions where they are defined
Added new config options to clickhouse.md with an additional explanation.

I decided not to change the code within _to_temp_table because it's not only about the engine. The ON CLUSTER clause would also need to be added, but without access to the configured cluster name, it cannot be made generic.
I'm also unsure whether ReplicatedMergeTree would work for setups with a single node, as I don't have an instance to test it.
This will still be handled in execute_query.

I used SQLGlot where possible. Unfortunately, SQLGlot doesn't fully support every ClickHouse statement, and I clearly stated this in the comments.
The solution looks a bit like Frankenstein's monster, combining sqlglot and RegExp.
I tested the code with GSheet, Stripe, and Salesforce, and it runs without issue.
Let me know if you have any suggestions or questions.

P.S. SQLGlot changes ` to ", I couldn't find a way to change that behaviour

Edit:
Latest commit fixes a bug with TRUNCATE statments, they should work on base tables and not distributed tables or the data is not removed (also stated in the comment). Good thing about testing in production is that all bugs are cought... eventually...

Edit 2:
I have initially missed that it's not allowed to run updates on the disitributed table, those need to be executed on the base table. This is now adjusted. Sorry for missing this initially. I am now testing merge with scd2 and this came up.
Also, I could not find the function which generates update statments. I found gen_update_table_prefix and that one currently doesn't have access to config so I can't make it work for a single node and a disitributed setup. Again, I'm hadling this in the execute_query

…orking on disitributed setup without GLOBAL

Zoran Stipanicev added 2 commits April 29, 2025 18:44

Added support for Clickhouse distributed tables with ON CLUSTER

32ff183

restore emoji

71de99e

zstipanicev mentioned this pull request Apr 29, 2025

Cluster support for clickhouse #2200

Open

rudolfix reviewed Apr 30, 2025

View reviewed changes

Adjusting CREATE and DROP statements where they are defined and using…

3c54bad

… sqlglot to alter queries when possible

sh-rp assigned rudolfix May 21, 2025

Zoran Stipanicev added 2 commits May 22, 2025 18:00

Fix TRUNCATE to truncate base table and not distributed table

29a82b4

Fixing my lack of focus, updated hardcoded part

b7297ff

rudolfix added the ci from fork run ci workflows on a pr even if they are from a fork label May 26, 2025

Zoran Stipanicev added 2 commits May 29, 2025 19:40

Updates need to run on the base table

dd887af

Fixed my obvious bug and added settings to cursor to enable queries w…

5593683

…orking on disitributed setup without GLOBAL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/2200 add clickhouse distributed support #2573

Feat/2200 add clickhouse distributed support #2573

Uh oh!

zstipanicev commented Apr 29, 2025

Uh oh!

netlify bot commented Apr 29, 2025 •

edited

Loading

Uh oh!

rudolfix left a comment

Uh oh!

rudolfix Apr 30, 2025

Uh oh!

zstipanicev Apr 30, 2025

Uh oh!

rudolfix May 2, 2025

Uh oh!

rudolfix Apr 30, 2025

Uh oh!

zstipanicev Apr 30, 2025

Uh oh!

rudolfix Apr 30, 2025

Uh oh!

zstipanicev Apr 30, 2025

Uh oh!

zstipanicev May 8, 2025

Uh oh!

rudolfix May 12, 2025

Uh oh!

zstipanicev commented May 2, 2025

Uh oh!

rudolfix commented May 2, 2025

Uh oh!

zstipanicev commented May 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Feat/2200 add clickhouse distributed support #2573

Are you sure you want to change the base?

Feat/2200 add clickhouse distributed support #2573

Uh oh!

Conversation

zstipanicev commented Apr 29, 2025

Description

Related Issues

Additional Context

Uh oh!

netlify bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zstipanicev commented May 2, 2025

Uh oh!

rudolfix commented May 2, 2025

Uh oh!

zstipanicev commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Apr 29, 2025 •

edited

Loading

zstipanicev commented May 15, 2025 •

edited

Loading