clickhouse_loader.py versus direct load from S3

I'm at the stage where I think the sink connector is absolutely the right choice for me, but the one downside I've encountered is the initial sync, with not dumping the data to kafka and having it picked up at leisure if the initial snapshot fails for any reason, essentially the whole process starts again.

I have ~ 500 mysql databases with ~75 tables in each to sync and I am breaking them down into batches of 50 with a connector for each so the need to get them all done as an initial sync is proving quite challenging.

The clickhouse_loader script looks like a winner here, however I also see that clickhouse supports direct import from S3 so was wondering if anyone had any insights as to whether one option is better than the other?

Ideally I'll be scripting the whole process to get the initial syncs done from a registry of databases stored in clickhouse and then automatically switch on the cdc once the sync is completed.

The scripts here look like they are probably tested reasonably well (although I don't see an obvious way to capture the bin log pos in a single transaction from the mysql dumper script, although I did only have a quick look).

In theory dumping the dbs to s3 should be reasonably straightforward so in absence of suggestions, I'm probably at the point of tossing a coin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clickhouse_loader.py versus direct load from S3 #1162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clickhouse_loader.py versus direct load from S3 #1162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions