Skip to content

clickhouse_loader.py versus direct load from S3 #1162

@Bens-ct

Description

@Bens-ct

I'm at the stage where I think the sink connector is absolutely the right choice for me, but the one downside I've encountered is the initial sync, with not dumping the data to kafka and having it picked up at leisure if the initial snapshot fails for any reason, essentially the whole process starts again.

I have ~ 500 mysql databases with ~75 tables in each to sync and I am breaking them down into batches of 50 with a connector for each so the need to get them all done as an initial sync is proving quite challenging.

The clickhouse_loader script looks like a winner here, however I also see that clickhouse supports direct import from S3 so was wondering if anyone had any insights as to whether one option is better than the other?

Ideally I'll be scripting the whole process to get the initial syncs done from a registry of databases stored in clickhouse and then automatically switch on the cdc once the sync is completed.

The scripts here look like they are probably tested reasonably well (although I don't see an obvious way to capture the bin log pos in a single transaction from the mysql dumper script, although I did only have a quick look).

In theory dumping the dbs to s3 should be reasonably straightforward so in absence of suggestions, I'm probably at the point of tossing a coin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions